用户指南 - Huawei › usermanual-mrs › mrs... · MapReduce服务 用户指南 文档版本 08...

1153
MapReduce 服务 用户指南 文档版本 08 发布日期 2020-03-18 华为技术有限公司

Transcript of 用户指南 - Huawei › usermanual-mrs › mrs... · MapReduce服务 用户指南 文档版本 08...

  • MapReduce 服务

    用户指南

    文档版本 08

    发布日期 2020-03-18

    华为技术有限公司

  • 版权所有 © 华为技术有限公司 2020。 保留一切权利。

    非经本公司书面许可,任何单位和个人不得擅自摘抄、复制本文档内容的部分或全部,并不得以任何形式传播。 商标声明

    和其他华为商标均为华为技术有限公司的商标。本文档提及的其他所有商标或注册商标,由各自的所有人拥有。 注意

    您购买的产品、服务或特性等应受华为公司商业合同和条款的约束,本文档中描述的全部或部分产品、服务或特性可能不在您的购买或使用范围之内。除非合同另有约定,华为公司对本文档内容不做任何明示或默示的声明或保证。

    由于产品版本升级或其他原因,本文档内容会不定期进行更新。除非另有约定,本文档仅作为使用指导,本文档中的所有陈述、信息和建议不构成任何明示或暗示的担保。

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 i

  • 目 录

    1 IAM 权限管理............................................................................................................................. 11.1 创建用户并授权使用 MRS...................................................................................................................................................... 11.2 MRS 自定义策略....................................................................................................................................................................... 21.3 IAM 用户同步 MRS.................................................................................................................................................................. 3

    2 入门..............................................................................................................................................82.1 如何使用 MRS............................................................................................................................................................................82.2 创建集群......................................................................................................................................................................................82.3 上传示例数据和程序............................................................................................................................................................. 102.4 添加作业................................................................................................................................................................................... 132.5 删除集群................................................................................................................................................................................... 17

    3 配置集群....................................................................................................................................193.1 概览............................................................................................................................................................................................ 193.2 集群列表简介...........................................................................................................................................................................203.3 购买方式简介...........................................................................................................................................................................233.4 快速购买 Hadoop 分析集群................................................................................................................................................ 233.5 快速购买 HBase 分析集群................................................................................................................................................... 243.6 快速购买 Kafka 流式集群.................................................................................................................................................... 253.7 自定义购买集群...................................................................................................................................................................... 263.8 创建最小规格集群.................................................................................................................................................................. 393.9 创建专属云 MRS 集群........................................................................................................................................................... 403.10 配置存算分离集群............................................................................................................................................................... 533.11 添加集群标签........................................................................................................................................................................ 603.12 通过引导操作安装第三方软件..........................................................................................................................................633.12.1 引导操作简介.....................................................................................................................................................................633.12.2 准备引导操作脚本............................................................................................................................................................ 633.12.3 查看执行记录.....................................................................................................................................................................643.12.4 添加引导操作.....................................................................................................................................................................643.12.5 脚本样例............................................................................................................................................................................. 67

    4 管理现有集群............................................................................................................................734.1 查看和监控集群...................................................................................................................................................................... 734.1.1 查看集群基本信息.............................................................................................................................................................. 734.1.2 查看集群补丁信息.............................................................................................................................................................. 77

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 ii

  • 4.1.3 查看和定制集群监控指标................................................................................................................................................. 774.1.4 管理组件和主机监控.......................................................................................................................................................... 794.2 扩容集群................................................................................................................................................................................... 844.3 缩容集群................................................................................................................................................................................... 874.4 退订包周期集群指定节点..................................................................................................................................................... 894.5 配置弹性伸缩规则.................................................................................................................................................................. 924.6 创建集群时配置弹性伸缩规则..........................................................................................................................................1044.7 升级 Master 节点规格........................................................................................................................................................ 1104.8 配置消息通知........................................................................................................................................................................ 1174.9 运维......................................................................................................................................................................................... 1194.9.1 运维授权............................................................................................................................................................................. 1194.9.2 日志共享............................................................................................................................................................................. 1204.10 删除集群.............................................................................................................................................................................. 1214.11 退订集群.............................................................................................................................................................................. 1214.12 删除失败任务......................................................................................................................................................................1224.13 作业管理.............................................................................................................................................................................. 1224.13.1 MRS 作业简介................................................................................................................................................................. 1224.13.2 运行 MapReduce 作业..................................................................................................................................................1264.13.3 运行 Spark 作业..............................................................................................................................................................1314.13.4 运行 HiveSql 作业.......................................................................................................................................................... 1354.13.5 运行 SparkSql 作业........................................................................................................................................................1394.13.6 运行 Flink 作业............................................................................................................................................................... 1444.13.7 运行 Kafka 作业............................................................................................................................................................. 1464.13.8 查看作业配置信息和日志.............................................................................................................................................1484.13.9 停止作业........................................................................................................................................................................... 1484.13.10 复制作业........................................................................................................................................................................ 1494.13.11 删除作业........................................................................................................................................................................ 1514.13.12 使用 OBS 加密数据运行作业.................................................................................................................................... 1514.13.13 配置作业消息通知....................................................................................................................................................... 1584.14 管理数据文件......................................................................................................................................................................1594.15 组件管理.............................................................................................................................................................................. 1634.15.1 对象管理简介.................................................................................................................................................................. 1634.15.2 查看配置........................................................................................................................................................................... 1644.15.3 管理服务操作.................................................................................................................................................................. 1664.15.4 配置服务参数.................................................................................................................................................................. 1684.15.5 配置服务自定义参数..................................................................................................................................................... 1714.15.6 同步服务配置.................................................................................................................................................................. 1754.15.7 管理角色实例操作..........................................................................................................................................................1764.15.8 配置角色实例参数..........................................................................................................................................................1784.15.9 同步角色实例配置..........................................................................................................................................................1814.15.10 退服和入服务角色实例.............................................................................................................................................. 1834.15.11 管理主机(节点)操作.............................................................................................................................................. 184

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 iii

  • 4.15.12 隔离主机........................................................................................................................................................................ 1854.15.13 取消隔离主机................................................................................................................................................................ 1864.15.14 启动及停止集群........................................................................................................................................................... 1874.15.15 同步集群配置................................................................................................................................................................ 1884.15.16 导出集群的配置数据................................................................................................................................................... 1884.15.17 支持滚动重启................................................................................................................................................................ 1894.16 告警管理.............................................................................................................................................................................. 1974.16.1 查看告警列表.................................................................................................................................................................. 1974.16.2 查看与手动清除告警..................................................................................................................................................... 1984.17 告警参考.............................................................................................................................................................................. 1994.17.1 ALM-12001 审计日志转储失败................................................................................................................................. 1994.17.2 ALM-12002 HA 资源异常........................................................................................................................................... 2004.17.3 ALM-12004 OLdap 资源异常.....................................................................................................................................2034.17.4 ALM-12005 OKerberos 资源异常............................................................................................................................. 2044.17.5 ALM-12006 节点故障...................................................................................................................................................2054.17.6 ALM-12007 进程故障...................................................................................................................................................2074.17.7 ALM-12010 Manager 主备节点间心跳中断.......................................................................................................... 2084.17.8 ALM-12011 Manager 主备节点同步数据异常......................................................................................................2104.17.9 ALM-12012 NTP 服务异常......................................................................................................................................... 2114.17.10 ALM-12016 CPU 使用率超过阈值.......................................................................................................................... 2144.17.11 ALM-12017 磁盘容量不足........................................................................................................................................2154.17.12 ALM-12018 内存使用率超过阈值...........................................................................................................................2174.17.13 ALM-12027 主机 PID 使用率超过阈值................................................................................................................. 2194.17.14 ALM-12028 主机 D 状态进程数超过阈值.............................................................................................................2204.17.15 ALM-12031 omm 用户或密码即将过期............................................................................................................... 2224.17.16 ALM-12032 ommdba 用户或密码即将过期........................................................................................................2234.17.17 ALM-12033 慢盘故障................................................................................................................................................ 2244.17.18 ALM-12034 周期备份任务失败............................................................................................................................... 2254.17.19 ALM-12035 恢复失败后数据状态未知.................................................................................................................. 2264.17.20 ALM-12037 NTP 服务器异常...................................................................................................................................2274.17.21 ALM-12038 监控指标转储失败............................................................................................................................... 2294.17.22 ALM-12039 GaussDB 主备数据不同步.................................................................................................................2314.17.23 ALM-12040 系统熵值不足........................................................................................................................................2334.17.24 ALM-13000 ZooKeeper 服务不可用......................................................................................................................2354.17.25 ALM-13001 ZooKeeper 可用连接数不足............................................................................................................. 2374.17.26 ALM-13002 ZooKeeper 内存使用量超过阈值.................................................................................................... 2394.17.27 ALM-14000 HDFS 服务不可用................................................................................................................................ 2414.17.28 ALM-14001 HDFS 磁盘空间使用率超过阈值......................................................................................................2434.17.29 ALM-14002 DataNode 磁盘空间使用率超过阈值.............................................................................................2454.17.30 ALM-14003 丢失的 HDFS 块数量超过阈值......................................................................................................... 2464.17.31 ALM-14004 损坏的 HDFS 块数量超过阈值......................................................................................................... 2484.17.32 ALM-14006 HDFS 文件数超过阈值....................................................................................................................... 249

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 iv

  • 4.17.33 ALM-14007 HDFS NameNode 内存使用率超过阈值.......................................................................................2504.17.34 ALM-14008 HDFS DataNode 内存使用率超过阈值......................................................................................... 2514.17.35 ALM-14009 故障 DataNode 数量超过阈值.........................................................................................................2534.17.36 ALM-14010 NameService 服务异常..................................................................................................................... 2554.17.37 ALM-14011 HDFS DataNode 数据目录配置不合理......................................................................................... 2574.17.38 ALM-14012 HDFS Journalnode 数据不同步...................................................................................................... 2604.17.39 ALM-16000 连接到 HiveServer 的 session 数占最大允许数的百分比超过阈值........................................2624.17.40 ALM-16001 Hive 数据仓库空间使用率超过阈值............................................................................................... 2634.17.41 ALM-16002 Hive SQL 执行成功率低于阈值....................................................................................................... 2654.17.42 ALM-16004 Hive 服务不可用.................................................................................................................................. 2684.17.43 ALM-18000 Yarn 服务不可用.................................................................................................................................. 2714.17.44 ALM-18002 NodeManager 心跳丢失...................................................................................................................2734.17.45 ALM-18003 NodeManager 不健康....................................................................................................................... 2744.17.46 ALM-18006 执行 MapReduce 任务超时.............................................................................................................. 2754.17.47 ALM-19000 HBase 服务不可用.............................................................................................................................. 2774.17.48 ALM-19006 HBase 容灾同步失败.......................................................................................................................... 2784.17.49 ALM-25000 LdapServer 服务不可用..................................................................................................................... 2814.17.50 ALM-25004 LdapServer 数据同步异常.................................................................................................................2824.17.51 ALM-25500 KrbServer 服务不可用........................................................................................................................2854.17.52 ALM-27001 DBService 服务不可用....................................................................................................................... 2864.17.53 ALM-27003 DBService 主备节点间心跳中断...................................................................................................... 2894.17.54 ALM-27004 DBService 主备数据不同步...............................................................................................................2904.17.55 ALM-28001 Spark 服务不可用................................................................................................................................2924.17.56 ALM-26051 Storm 服务不可用............................................................................................................................... 2944.17.57 ALM-26052 Storm 服务可用 Supervisor 数量小于阈值.................................................................................. 2964.17.58 ALM-26053 Storm Slot 使用率超过阈值............................................................................................................. 2974.17.59 ALM-26054 Storm Nimbus 堆内存使用率超过阈值.........................................................................................2994.17.60 ALM-38000 Kafka 服务不可用................................................................................................................................3014.17.61 ALM-38001 Kafka 磁盘容量不足........................................................................................................................... 3024.17.62 ALM-38002 Kafka 堆内存使用率超过阈值.......................................................................................................... 3054.17.63 ALM-24000 Flume 服务不可用...............................................................................................................................3064.17.64 ALM-24001 Flume Agent 异常.............................................................................................................................. 3084.17.65 ALM-24003 Flume Client 连接中断...................................................................................................................... 3094.17.66 ALM-24004 Flume 读取数据异常.......................................................................................................................... 3114.17.67 ALM-24005 Flume 传输数据异常.......................................................................................................................... 3134.17.68 ALM-12041 关键文件权限异常............................................................................................................................... 3154.17.69 ALM-12042 关键文件配置异常............................................................................................................................... 3174.17.70 ALM-23001 Loader 服务不可用............................................................................................................................. 3184.17.71 ALM-12357 审计日志导出到 OBS 失败................................................................................................................ 3224.17.72 ALM-12014 设备分区丢失........................................................................................................................................3234.17.73 ALM-12015 设备分区文件系统只读...................................................................................................................... 3254.17.74 ALM-12043 DNS 解析时长超过阈值..................................................................................................................... 326

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 v

  • 4.17.75 ALM-12045 网络读包丢包率超过阈值.................................................................................................................. 3284.17.76 ALM-12046 网络写包丢包率超过阈值.................................................................................................................. 3324.17.77 ALM-12047 网络读包错误率超过阈值.................................................................................................................. 3344.17.78 ALM-12048 网络写包错误率超过阈值.................................................................................................................. 3354.17.79 ALM-12049 网络读吞吐率超过阈值...................................................................................................................... 3374.17.80 ALM-12050 网络写吞吐率超过阈值...................................................................................................................... 3394.17.81 ALM-12051 磁盘 Inode 使用率超过阈值............................................................................................................. 3404.17.82 ALM-12052 TCP 临时端口使用率超过阈值......................................................................................................... 3424.17.83 ALM-12053 文件句柄使用率超过阈值.................................................................................................................. 3444.17.84 ALM-12054 证书文件失效........................................................................................................................................3464.17.85 ALM-12055 证书文件即将过期............................................................................................................................... 3484.17.86 ALM-18008 Yarn ResourceManager 堆内存使用率超过阈值........................................................................3504.17.87 ALM-18009 MapReduce JobHistoryServer 堆内存使用率超过阈值............................................................3524.17.88 ALM-20002 Hue 服务不可用...................................................................................................................................3534.17.89 ALM-43001 Spark 服务不可用................................................................................................................................3564.17.90 ALM-43006 JobHistory 进程堆内存使用超出阈值............................................................................................ 3574.17.91 ALM-43007 JobHistory 进程非堆内存使用超出阈值........................................................................................ 3584.17.92 ALM-43008 JobHistory 进程直接内存使用超出阈值........................................................................................ 3604.17.93 ALM-43009 JobHistory GC 时间超出阈值...........................................................................................................3614.17.94 ALM-43010 JDBCServer 进程堆内存使用超出阈值...........................................................................................3634.17.95 ALM-43011 JDBCServer 进程非堆内存使用超出阈值...................................................................................... 3644.17.96 ALM-43012 JDBCServer 进程直接内存使用超出阈值...................................................................................... 3654.17.97 ALM-43013 JDBCServer GC 时间超出阈值......................................................................................................... 3674.17.98 ALM-44004 Presto Coordinator 资源组排队任务超过阈值............................................................................3684.17.99 ALM-44005 Presto Coordinator 进程垃圾收集时间超出阈值....................................................................... 3694.17.100 ALM-44006 Presto Worker 进程垃圾收集时间超出阈值.............................................................................. 3714.17.101 ALM-18010 Yarn 任务挂起数超过阈值.............................................................................................................. 3724.17.102 ALM-18011 Yarn 任务挂起内存超过阈值.......................................................................................................... 3734.17.103 ALM-18012 上个周期被终止的 Yarn 任务数超过阈值................................................................................... 3754.17.104 ALM-18013 上个周期运行失败的 Yarn 任务数超过阈值...............................................................................3764.17.105 ALM-16005 上个周期 Hive SQL 执行失败超过阈值.......................................................................................3764.18 补丁管理.............................................................................................................................................................................. 3774.18.1 MRS 1.7.0 前版本补丁操作指导.................................................................................................................................3774.18.2 MRS 1.7.0 后版本补丁操作指导.................................................................................................................................3784.18.3 滚动补丁........................................................................................................................................................................... 3794.18.4 修复隔离主机补丁..........................................................................................................................................................3824.19 MRS 补丁说明.................................................................................................................................................................... 3834.19.1 MRS 1.5.1.4 补丁说明................................................................................................................................................... 3834.19.2 MRS 1.7.1.3 补丁说明................................................................................................................................................... 3854.19.3 MRS 1.7.1.5 补丁说明................................................................................................................................................... 3874.19.4 MRS 1.7.1.6 补丁说明................................................................................................................................................... 3884.19.5 MRS 1.8.10.1 补丁说明................................................................................................................................................ 391

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 vi

  • 4.19.6 MRS 1.8.10.6 补丁说明................................................................................................................................................ 3914.19.7 MRS 2.0.1.1 补丁说明................................................................................................................................................... 3934.19.8 MRS 2.0.1.2 补丁说明................................................................................................................................................... 3934.19.9 MRS 2.0.1.3 补丁说明................................................................................................................................................... 3944.19.10 MRS 2.0.6.1 补丁说明................................................................................................................................................ 3954.19.11 MRS 2.1.0.1 补丁说明................................................................................................................................................ 3954.19.12 MRS 2.1.0.2 补丁说明................................................................................................................................................ 3964.19.13 MRS 2.1.0.3 补丁说明................................................................................................................................................ 3974.19.14 MRS 2.1.0.5 补丁说明................................................................................................................................................ 3994.19.15 MRS 2.1.0.6 补丁说明................................................................................................................................................ 4014.19.16 MRS 2.1.0.7 补丁说明................................................................................................................................................ 4034.20 日志管理.............................................................................................................................................................................. 4064.20.1 关于日志........................................................................................................................................................................... 4064.20.2 Manager 日志清单........................................................................................................................................................ 4174.20.3 查看及导出审计日志..................................................................................................................................................... 4244.20.4 导出服务日志.................................................................................................................................................................. 4264.20.5 配置审计日志导出参数................................................................................................................................................. 4264.21 健康检查管理......................................................................................................................................................................4284.21.1 执行健康检查.................................................................................................................................................................. 4284.21.2 查看并导出检查报告..................................................................................................................................................... 4294.21.3 DBService 健康检查指标项说明.................................................................................................................................4294.21.4 Flume 健康检查指标项说明........................................................................................................................................4304.21.5 HBase 健康检查指标项说明........................................................................................................................................4304.21.6 Host 健康检查指标项说明........................................................................................................................................... 4304.21.7 HDFS 健康检查指标项说明......................................................................................................................................... 4374.21.8 Hive 健康检查指标项说明........................................................................................................................................... 4374.21.9 Kafka 健康检查指标项说明......................................................................................................................................... 4384.21.10 KrbServer 健康检查指标项说明...............................................................................................................................4384.21.11 LdapServer 健康检查指标项说明............................................................................................................................ 4394.21.12 Loader 健康检查指标项说明.................................................................................................................................... 4404.21.13 MapReduce 健康检查指标项说明...........................................................................................................................4414.21.14 OMS 健康检查指标项说明........................................................................................................................................ 4414.21.15 Spark 健康检查指标项说明.......................................................................................................................................4454.21.16 Storm 健康检查指标项说明...................................................................................................................................... 4454.21.17 Yarn 健康检查指标项说明......................................................................................................................................... 4464.21.18 ZooKeeper 健康检查指标项说明.............................................................................................................................4464.22 租户管理.............................................................................................................................................................................. 4474.22.1 租户简介........................................................................................................................................................................... 4474.22.2 添加租户........................................................................................................................................................................... 4484.22.3 添加子租户.......................................................................................................................................................................4514.22.4 删除租户........................................................................................................................................................................... 4544.22.5 管理租户目录.................................................................................................................................................................. 455

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 vii

  • 4.22.6 恢复租户数据.................................................................................................................................................................. 4584.22.7 添加资源池.......................................................................................................................................................................4594.22.8 修改资源池.......................................................................................................................................................................4604.22.9 删除资源池.......................................................................................................................................................................4614.22.10 配置队列........................................................................................................................................................................ 4634.22.11 配置资源池的队列容量策略...................................................................................................................................... 4644.22.12 清除队列配置................................................................................................................................................................ 4654.23 备份与恢复.......................................................................................................................................................................... 4674.23.1 备份与恢复简介.............................................................................................................................................................. 4674.23.2 备份元数据.......................................................................................................................................................................4694.23.3 恢复元数据.......................................................................................................................................................................4714.23.4 修改备份任务.................................................................................................................................................................. 4734.23.5 查看备份恢复任务..........................................................................................................................................................4754.24 安全管理.............................................................................................................................................................................. 4764.24.1 未开启 Kerberos 认证集群中的默认用户清单........................................................................................................4764.24.2 开启 Kerberos 认证集群中的默认用户清单............................................................................................................ 4794.24.3 修改操作系统用户密码................................................................................................................................................. 4844.24.4 修改 admin 密码............................................................................................................................................................ 4844.24.5 修改 Kerberos 管理员密码...........................................................................................................................................4864.24.6 修改 LDAP 管理员和 LDAP 用户密码....................................................................................................................... 4874.24.7 修改组件运行用户密码................................................................................................................................................. 4884.24.8 修改 OMS 数据库管理员密码..................................................................................................................................... 4894.24.9 修改 OMS 数据库数据访问用户密码........................................................................................................................ 4904.24.10 修改组件数据库用户密码.......................................................................................................................................... 4904.24.11 更换 HA 证书................................................................................................................................................................ 4914.24.12 更新集群密钥................................................................................................................................................................ 4934.25 MRS 多用户权限管理....................................................................................................................................................... 4944.25.1 MRS 集群中的用户与权限........................................................................................................................................... 4944.25.2 开启 Kerberos 认证集群中的默认用户清单............................................................................................................ 4984.25.3 创建角色........................................................................................................................................................................... 5034.25.4 创建用户组.......................................................................................................................................................................5084.25.5 创建用户........................................................................................................................................................................... 5094.25.6 修改用户信息.................................................................................................................................................................. 5104.25.7 锁定用户........................................................................................................................................................................... 5114.25.8 解锁用户........................................................................................................................................................................... 5124.25.9 删除用户........................................................................................................................................................................... 5134.25.10 修改操作用户密码....................................................................................................................................................... 5144.25.11 初始化系统用户密码................................................................................................................................................... 5154.25.12 下载用户认证文件....................................................................................................................................................... 5164.25.13 修改密码策略................................................................................................................................................................ 5174.25.14 配置跨集群互信........................................................................................................................................................... 5184.25.15 配置并使用互信集群的用户...................................................................................................................................... 522

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 viii

  • 4.25.16 配置 MRS 多用户访问 OBS 细粒度权限................................................................................................................ 523

    5 管理历史集群..........................................................................................................................5285.1 查看历史集群基本信息....................................................................................................................................................... 528

    6 查看操作日志..........................................................................................................................532

    7 管理数据连接..........................................................................................................................534

    8 连接集群................................................................................................................................. 5408.1 登录集群................................................................................................................................................................................. 5408.1.1 集群节点简介.....................................................................................................................................................................5408.1.2 登录集群节点.....................................................................................................................................................................5418.1.3 如何确认 MRS Manger 的主备管理节点................................................................................................................... 5468.2 使用 MRS 客户端................................................................................................................................................................. 5478.2.1 集群内节点使用 MRS 客户端........................................................................................................................................ 5488.2.2 集群外节点使用 MRS 客户端........................................................................................................................................ 5498.2.3 更新客户端......................................................................................................................................................................... 5528.3 访问 MRS 集群上托管的开源组件 Web 页面............................................................................................................... 5568.3.1 开源组件 Web 站点..........................................................................................................................................................5568.3.2 开源组件端口列表............................................................................................................................................................ 5628.3.3 通过弹性公网 IP 访问...................................................................................................................................................... 5758.3.4 通过 Windows 弹性云服务器访问............................................................................................................................... 5788.3.5 创建连接 MRS 集群的 SSH 隧道并配置浏览器........................................................................................................ 579

    9 MRS Manager 操作指导...................................................................................................... 5839.1 MRS Manager 简介............................................................................................................................................................ 5839.2 访问 MRS Manager............................................................................................................................................................ 5869.3 访问支持 Kerberos 认证的 Manager..............................................................................................................................5909.4 查看集群运行任务............................................................................................................................................................... 5969.5 监控管理................................................................................................................................................................................. 5969.5.1 系统概览............................................................................................................................................................................. 5969.5.2 管理服务和主机监控........................................................................................................................................................5999.5.3 管理资源分布.....................................................................................................................................................................6049.5.4 配置监控指标转储............................................................................................................................................................ 6059.6 告警管理................................................................................................................................................................................. 6069.6.1 查看与手动清除告警........................................................................................................................................................6069.6.2 配置监控与告警阈值........................................................................................................................................................6079.6.3 配置 Syslog 北向参数...................................................................................................................................................... 6089.6.4 配置 SNMP 北向参数...................................................................................................................................................... 6119.7 告警参考................................................................................................................................................................................. 6129.7.1 ALM-12001 审计日志转储失败....................................................................................................................................6129.7.2 ALM-12002 HA 资源异常.............................................................................................................................................. 6149.7.3 ALM-12004 OLdap 资源异常....................................................................................................................................... 6169.7.4 ALM-12005 OKerberos 资源异常............................................................................................................................... 617

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 ix

  • 9.7.5 ALM-12006 节点故障..................................................................................................................................................... 6189.7.6 ALM-12007 进程故障..................................................................................................................................................... 6209.7.7 ALM-12010 Manager 主备节点间心跳中断.............................................................................................................6229.7.8 ALM-12011 Manager 主备节点同步数据异常........................................................................................................ 6239.7.9 ALM-12012 NTP 服务异常............................................................................................................................................ 6249.7.10 ALM-12016 CPU 使用率超过阈值............................................................................................................................ 6279.7.11 ALM-12017 磁盘容量不足.......................................................................................................................................... 6289.7.12 ALM-12018 内存使用率超过阈值............................................................................................................................. 6309.7.13 ALM-12027 主机 PID 使用率超过阈值.................................................................................................................... 6319.7.14 ALM-12028 主机 D 状态进程数超过阈值............................................................................................................... 6339.7.15 ALM-12031 omm 用户或密码即将过期..................................................................................................................6349.7.16 ALM-12032 ommdba 用户或密码即将过期.......................................................................................................... 6369.7.17 ALM-12033 慢盘故障...................................................................................................................................................6379.7.18 ALM-12034 周期备份任务失败................................................................................................................................. 6389.7.19 ALM-12035 恢复失败后数据状态未知.................................................................................................................... 6399.7.20 ALM-12037 NTP 服务器异常..................................................................................................................................... 6409.7.21 ALM-12038 监控指标转储失败................................................................................................................................. 6429.7.22 ALM-12039 GaussDB 主备数据不同步................................................................................................................... 6449.7.23 ALM-12040 系统熵值不足.......................................................................................................................................... 6469.7.24 ALM-13000 ZooKeeper 服务不可用........................................................................................................................ 6479.7.25 ALM-13001 ZooKeeper 可用连接数不足............................................................................................................... 6509.7.26 ALM-13002 ZooKeeper 内存使用量超过阈值.......................................................................................................6529.7.27 ALM-14000 HDFS 服务不可用.................................................................................................................................. 6539.7.28 ALM-14001 HDFS 磁盘空间使用率超过阈值........................................................................................................ 6559.7.29 ALM-14002 DataNode 磁盘空间使用率超过阈值............................................................................................... 6569.7.30 ALM-14003 丢失的 HDFS 块数量超过阈值........................................................................................................... 6589.7.31 ALM-14004 损坏的 HDFS 块数量超过阈值........................................................................................................... 6599.7.32 ALM-14006 HDFS 文件数超过阈值..........................................................................................................................6609.7.33 ALM-14007 HDFS NameNode 内存使用率超过阈值......................................................................................... 6629.7.34 ALM-14008 HDFS DataNode 内存使用率超过阈值........................................................................................... 6639.7.35 ALM-14009 故障 DataNode 数量超过阈值........................................................................................................... 6649.7.36 ALM-14010 NameService 服务异常........................................................................................................................ 6669.7.37 ALM-14011 HDFS DataNode 数据目录配置不合理........................................................................................... 6699.7.38 ALM-14012 HDFS Journalnode 数据不同步.........................................................................................................6719.7.39 ALM-16000 连接到 HiveServer 的 session 数占最大允许数的百分比超过阈值.......................................... 6739.7.40 ALM-16001 Hive 数据仓库空间使用率超过阈值..................................................................................................6749.7.41 ALM-16002 Hive SQL 执行成功率低于阈值..........................................................................................................6769.7.42 ALM-16004 Hive 服务不可用.................................................................................................................................... 6789.7.43 ALM-18000 Yarn 服务不可用.................................................................................................................................... 6819.7.44 ALM-18002 NodeManager 心跳丢失..................................................................................................................... 6839.7.45 ALM-18003 NodeManager 不健康..........................................................................................................................6849.7.46 ALM-18006 执行 MapReduce 任务超时.................................................................................................................685

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 x

  • 9.7.47 ALM-19000 HBase 服务不可用................................................................................................................................. 6869.7.48 ALM-19006 HBase 容灾同步失败............................................................................................................................ 6889.7.49 ALM-25000 LdapServer 服务不可用....................................................................................................................... 6909.7.50 ALM-25004 LdapServer 数据同步异常................................................................................................................... 6929.7.51 ALM-25500 KrbServer 服务不可用.......................................................................................................................... 6949.7.52 ALM-27001 DBService 服务不可用..........................................................................................................................6959.7.53 ALM-27003 DBService 主备节点间心跳中断........................................................................................................ 6989.7.54 ALM-27004 DBService 主备数据不同步................................................................................................................. 6999.7.55 ALM-28001 Spark 服务不可用.................................................................................................................................. 7019.7.56 ALM-26051 Storm 服务不可用................................................................................................................................. 7029.7.57 ALM-26052 Storm 服务可用 Supervisor 数量小于阈值..................................................................................... 7049.7.58 ALM-26053 Storm Slot 使用率超过阈值................................................................................................................7069.7.59 ALM-26054 Storm Nimbus 堆内存使用率超过阈值........................................................................................... 7079.7.60 ALM-38000 Kafka 服务不可用.................................................................................................................................. 7099.7.61 ALM-38001 Kafka 磁盘容量不足..............................................................................................................................7109.7.62 ALM-38002 Kafka 堆内存使用率超过阈值............................................................................................................ 7139.7.63 ALM-24000 Flume 服务不可用................................................................................................................................. 7149.7.64 ALM-24001 Flume Agent 异常................................................................................................................................. 7159.7.65 ALM-24003 Flume Client 连接中断.........................................................................................................................7179.7.66 ALM-24004 Flume 读取数据异常.............................................................................................................................7199.7.67 ALM-24005 Flume 传输数据异常.............................................................................................................................7219.7.68 ALM-12041 关键文件权限异常..................................................................................................................................7239.7.69 ALM-12042 关键文件配置异常................................................................................................................................. 7249.7.70 ALM-23001 Loader 服务不可用................................................................................................................................7259.7.71 ALM-12357 审计日志导出到 OBS 失败.................................................................................................................. 7289.7.72 ALM-12014 设备分区丢失.......................................................................................................................................... 7309.7.73 ALM-12015 设备分区文件系统只读.........................................................................................................................7319.7.74 ALM-12043 DNS 解析时长超过阈值....................................................................................................................... 7339.7.75 ALM-12045 网络读包丢包率超过阈值.................................................................................................................... 7359.7.76 ALM-12046 网络写包丢包率超过阈值.................................................................................................................... 7399.7.77 ALM-12047 网络读包错误率超过阈值.................................................................................................................... 7409.7.78 ALM-12048 网络写包错误率超过阈值.................................................................................................................... 7429.7.79 ALM-12049 网络读吞吐率超过阈值.........................................................................................................................7449.7.80 ALM-12050 网络写吞吐率超过阈值.........................................................................................................................7459.7.81 ALM-12051 磁盘 Inode 使用率超过阈值................................................................................................................7479.7.82 ALM-12052 TCP 临时端口使用率超过阈值............................................................................................................7499.7.83 ALM-12053 文件句柄使用率超过阈值.................................................................................................................... 7519.7.84 ALM-12054 证书文件失效.......................................................................................................................................... 7529.7.85 ALM-12055 证书文件即将过期................................................................................................................................. 7549.7.86 ALM-18008 Yarn ResourceManager 堆内存使用率超过阈值.......................................................................... 7579.7.87 ALM-18009 MapReduce JobHistoryServer 堆内存使用率超过阈值.............................................................. 7589.7.88 ALM-20002 Hue 服务不可用..................................................................................................................................... 760

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 xi

  • 9.7.89 ALM-43001 Spark 服务不可用.................................................................................................................................. 7629.7.90 ALM-43006 JobHistory 进程堆内存使用超出阈值...............................................................................................7639.7.91 ALM-43007 JobHistory 进程非堆内存使用超出阈值.......................................................................................... 7659.7.92 ALM-43008 JobHistory 进程直接内存使用超出阈值.......................................................................................... 7669.7.93 ALM-43009 JobHistory GC 时间超出阈值............................................................................................................. 7679.7.94 ALM-43010 JDBCServer 进程堆内存使用超出阈值............................................................................................. 7699.7.95 ALM-43011 JDBCServer 进程非堆内存使用超出阈值.........................................................................................7709.7.96 ALM-43012 JDBCServer 进程直接内存使用超出阈值.........................................................................................7719.7.97 ALM-43013 JDBCServer GC 时间超出阈值........................................................................................................... 7739.7.98 ALM-44004 Presto Coordinator 资源组排队任务超过阈值.............................................................................. 7749.7.99 ALM-44005 Presto Coordinator 进程垃圾收集时间超出阈值..........................................................................7759.7.100 ALM-44006 Presto Worker 进程垃圾收集时间超出阈值................................................................................ 7769.7.101 ALM-18010 Yarn 任务挂起数超过阈值.................................................................................................................7789.7.102 ALM-18011 Yarn 任务挂起内存超过阈值............................................................................................................ 7799.7.103 ALM-18012 上个周期被终止的 Yarn 任务数超过阈值......................................................................................7819.7.104 ALM-18013 上个周期运行失败的 Yarn 任务数超过阈值................................................................................. 7819.7.105 ALM-16005 上个周期 Hive SQL 执行失败超过阈值......................................................................................... 7829.8 对象管理................................................................................................................................................................................. 7839.8.1 对象管理简介.....................................................................................................................................................................7839.8.2 查看配置............................................................................................................................................................................. 7849.8.3 管理服务操作.....................................................................................................................................................................7849.8.4 配置服务参数.....................................................................................................................................................................7859.8.5 配置服务自定义参数........................................................................................................................................................7869.8.6 同步服务配置.....................................................................................................................................................................7899.8.7 管理角色实例操作............................................................................................................................................................ 7899.8.8 配置角色实例参数............................................................................................................................................................ 7899.8.9 同步角色实例配置............................................................................................................................................................ 7919.8.10 退服和入服务角色实例................................................................................................................................................. 7919.8.11 管理主机操作.................................................................................................................................................................. 7929.8.12 隔离主机........................................................................................................................................................................... 7929.8.13 取消隔离主机.................................................................................................................................................................. 7939.8.14 启动及停止集群.............................................................................................................................................................. 7939.8.15 同步集群配置.................................................................................................................................................................. 7949.8.16 导出集群的配置数据..................................................................................................................................................... 7949.9 日志管理................................................................................................................................................................................. 7959.9.1 查看及导出审计日志........................................................................................................................................................7959.9.2 导出服务日志.....................................................................................................................................................................7969.9.3 配置审计日志导出参数................................................................................................................................................... 7979.10 健康检查管理......................................................................................................................................................................7989.10.1 执行健康检查.................................................................................................................................................................. 7989.10.2 查看并导出检查报告..................................................................................................................................................... 7999.10.3 配置健康检查报告保存数.............................................................................................................................................800

    MapReduce 服务用户指南 目 录

    文档版本 08 (2020-03-18) 版权所有 © 华为技术有限公司 xii

  • 9.10.4 管理健康检查报告..........................................................................................................................................................8009.10.5 DBService 健康检查指标项说明.................................................................................................................................8019.10.6 Flume 健康检查指标项说明........................................................................................................................................8019.10.7 HBase 健康检查指标项说明........................................................................................................................................8019.10.8 Host 健康检查指标项说明........................................................................................................................................... 8029.10.9 HDFS 健康检查指标项说明......................................................................................................................................... 8089.10.10 Hive 健康检查指标项说明......................................................................................................................................... 8089.10.11 Kafka 健康检查指标项说明...................................................................................................................................... 8099.10.12 KrbServer 健康检查指标项说明...............................................................................................................................8109.10.13 LdapServer 健康检查指标项说明............................................................................................................................ 8109.10.14 Loader 健康检查指标项说明.................................................................................................................................... 8119.10.15 MapReduce 健康检查指标项说明...........................................................................................................................8129.10.16 OMS 健康检查指标项说明........................................................................................................................................ 8139.10.17 Spark 健康检查指标项说明.......................................................................................................................................8169.10.18 Storm 健康检查指标项说明...................................................................................................................................... 8169.10.19 Yarn 健康检查指标项说明......................................................................................................................................... 8179.10.20 ZooKeeper 健康检查指标项说明.............................................................................................................................8179.11 静态服务池管理................................................................................................................................................................. 8189.11.1 查看静态服务池状态..................................................................................................................................................... 8189.11.2 配置静态服务池.............................................................................................................................................................. 8209.12 租户管理.............................................................................................................................................................................. 8229.12.1 租户简介........................................................................................................................................................................... 8229.12.2 添加租户........................................................................................................................................................................... 8239.12.3 添加子租户.......................................................................................................................................................................8259.12.4 删除租户........................................................................................................................................................................... 8279.12.5 管理租户目录.................................................................................................................................................................. 8289.12.6 恢复租户数据.................................................................................................................................................................. 8299.12.7 添加资源池.......................................................................................................................................................................8309.12.8 修改资源池.......................................................................................................................................................................8319.12.9 删除资源池.......................................................................................................................................................................8319.12.10 配置队列........................................................................................................................................................................ 8329.12.11 配置资源池的队列容量策略...................................................................................................................................... 8339.12.12 清除队列配置................................................................................................................................................................ 8339.13 备份与恢复.......................................................................................................................................................................... 8349.13.1 备份与恢复简介.............................................................................................................................................................. 8349.13.2 备份元数据.......................................................................................................................................................................8369.13.3 恢复元数据.......................................................................................................................................................................8379.13.4 修改备份任务.................................................................................................................................................................. 8399.13.5 查看备份恢复任务..........................................................................................................................................................8409.14 安全管理.............................................................................................................................................................................. 8419.14.1 未开启 Kerberos 认证集群中的默认用户清单........................................................................................................8419.14.2 修改操作系统用户密码................................................................................................................................................. 844

    MapReduce 服务用户指南 �