Hive yog ib qho cuab yeej siv loj Cov Ntaub Ntawv Analytics hauv kev lag luam, thiab nws yog qhov chaw zoo pib yog tias koj tshiab rau Cov Ntaub Ntawv Loj. Apache Hive zaj lus qhia no dhau los ntawm cov ntsiab lus ntawm Apache Hive, vim li cas lub Hive yog qhov tsim nyog, nws cov yam ntxwv, thiab txhua yam koj yuav tsum paub.
Cia peb pib nkag siab txog Hadoop lub moj khaum uas Apache Hive tau tsim rau.
Apache Hadoop
Apache Hadoop yog dawb thiab Qhib qhov chaw platform rau khaws cia thiab ua cov ntaub ntawv loj xws li qhov loj me ntawm gigabytes rau petabytes. Hadoop tso cai rau pawg ntau lub khoos phis tawj los txheeb xyuas cov ntaub ntawv loj heev nyob rau hauv tib lub sijhawm, tsis yog xav tau ib lub khoos phis tawj loj los khaws thiab txheeb xyuas cov ntaub ntawv.
MapReduce thiab Hadoop Distributed File System yog ob qho tib si:
- MapQhia - MapReduce yog cov txheej txheem sib txuas ua ke rau kev tuav cov ntim loj ntawm cov txheej txheem, ib nrab txheej txheem, thiab cov ntaub ntawv tsis tsim nyog ntawm cov khoom siv kho vajtse.
- HDFS - HDFS (Hadoop Distributed File System) yog Hadoop lub moj khaum uas khaws thiab ua cov ntaub ntawv. Nws yog ib qho txhaum-tolerant file system uas khiav ntawm tus qauv kho vajtse
Cov phiaj xwm sib txawv (cov cuab yeej) hauv Hadoop ecosystem, suav nrog Sqoop, Pig, thiab Hive, yog siv los pab Hadoop modules.
- Tshees - Hive yog lub hauv paus rau kev sau SQL-style scripts uas ua MapReduce suav.
- npua - Npua yog cov lus txheej txheem txheej txheem uas yuav raug siv los tsim ib tsab ntawv rau MapReduce txheej txheem.
- Sqoop - Sqoop yog lub cuab yeej rau kev xa tawm thiab xa tawm cov ntaub ntawv ntawm HDFS thiab RDBMS.
Yuav ua li cas yog Apache Hive?
Apache Hive yog qhov chaw qhib cov ntaub ntawv warehouse program rau kev nyeem ntawv, sau ntawv, thiab tswj cov ntaub ntawv loj loj khaws cia ncaj qha rau hauv Apache Hadoop Distributed File System (HDFS) lossis lwm cov ntaub ntawv khaws cia xws li Apache HBase.
Cov neeg tsim tawm SQL tuaj yeem siv Hive los tsim Hive Query Language (HQL) cov lus rau cov ntaub ntawv nug thiab kev tshuaj xyuas uas piv rau cov nqe lus SQL li niaj zaus. Nws tau tsim los ua kom MapReduce programming yooj yim dua los ntawm kev tshem tawm qhov xav tau kawm thiab sau Java code ntev. Hloov chaw, koj tuaj yeem sau koj cov lus nug hauv HQL, thiab Hive yuav tsim daim ntawv qhia thiab txo cov haujlwm rau koj.
Lub SQL-zoo li interface ntawm Apache Hive tau dhau los ua Tus Qauv Kub rau kev ua haujlwm ad-hoc tshawb nrhiav, xaus lus, thiab txheeb xyuas cov ntaub ntawv Hadoop. Thaum suav nrog huab suav networks, qhov kev daws teeb meem no tshwj xeeb tshaj yog tus nqi tsim nyog thiab muaj peev xwm ua tau, uas yog vim li cas ntau lub tuam txhab, suav nrog Netflix thiab Amazon, txuas ntxiv txhim kho thiab txhim kho Apache Hive.
Keeb kwm
Thaum lawv nyob hauv Facebook, Joydeep Sen Sarma thiab Ashish Thusoo tau tsim Apache Hive. Lawv ob leeg lees paub tias kom tau txais txiaj ntsig zoo tshaj plaws ntawm Hadoop, lawv yuav tsum tau tsim qee qhov nyuaj Java Map-Txo cov haujlwm. Lawv tau lees paub tias lawv yuav tsis muaj peev xwm qhia lawv txoj kev nthuav dav dav dav thiab pab pawg tshuaj ntsuam xyuas ntawm cov txuj ci uas lawv xav tau los txhawb Hadoop thoob plaws lub tuam txhab. Cov kws tshaj lij thiab cov kws tshuaj ntsuam feem ntau siv SQL raws li tus neeg siv interface.
Thaum SQL tuaj yeem ua tau raws li feem ntau ntawm cov kev xav tau kev xav tau, cov neeg tsim khoom kuj tseem npaj los koom nrog Hadoop's programmability. Apache Hive tau tshwm sim los ntawm ob lub hom phiaj no: SQL-raws li cov lus tshaj tawm uas tseem ua rau cov neeg tsim khoom nqa lawv tus kheej cov ntawv sau thiab cov kev pab cuam thaum SQL tsis txaus.
Nws kuj tau tsim los tuav cov metadata hauv nruab nrab (Hadoop-raws li) txog tag nrho cov datasets hauv lub tuam txhab kom tsim kho cov ntaub ntawv-tsav cov koom haum yooj yim dua.
Apache Hive ua haujlwm li cas?
Hauv cov ntsiab lus, Apache Hive hloov ib qho kev pab cuam sau ntawv hauv HiveQL (SQL-zoo li) hom lus rau hauv ib lossis ntau dua Java MapReduce, Tez, lossis Spark cov haujlwm. (Tag nrho cov tshuab ua tiav no tau sib xws nrog Hadoop YARN.) Tom qab ntawd, Apache Hive npaj cov ntaub ntawv rau hauv cov ntxhuav rau Hadoop Distributed File System HDFS) thiab ua cov haujlwm ntawm pawg kom tau txais cov lus teb.
Cov ntaub ntawv
Lub Apache Hive cov ntxhuav tau teeb tsa tib yam li cov ntxhuav hauv cov ntaub ntawv sib raug zoo raug teeb tsa, nrog cov ntaub ntawv sib txawv ntawm qhov loj los ntawm qhov loj dua mus rau qhov me. Databases yog tsim los ntawm cov rooj uas muab faib ua pawg, uas tau muab faib ntxiv rau hauv thoob. HiveQL (Hive Query Language) yog siv los nkag rau cov ntaub ntawv, uas tuaj yeem hloov kho lossis ntxiv ntxiv. Cov ntaub ntawv rooj yog serialized nyob rau hauv txhua lub database, thiab txhua lub rooj muaj nws tus kheej HDFS directory.
Architecture
Tam sim no peb yuav tham txog qhov tseem ceeb tshaj plaws ntawm Hive Architecture. Cov khoom ntawm Apache Hive yog raws li nram no:
Metastore - Nws khaws cov ntaub ntawv hais txog txhua lub rooj, xws li nws cov qauv thiab qhov chaw. Qhov muab faib metadata kuj tseem suav nrog hauv Hive. Qhov no tso cai rau tus neeg tsav tsheb kom taug qab qhov kev nce qib ntawm cov ntaub ntawv sib txawv sib kis thoob plaws hauv pawg. Cov ntaub ntawv tau muab khaws cia rau hauv ib hom ntawv RDBMS. Hive metadata yog qhov tseem ceeb heev rau tus tsav tsheb kom tswj tau cov ntaub ntawv. Tus neeg rau zaub mov thaub qab duplicates cov ntaub ntawv tsis tu ncua kom nws tuaj yeem rov qab tau thaum cov ntaub ntawv poob.
Tsav - HiveQL cov lus tau txais los ntawm tus tsav tsheb, uas ua haujlwm ua tus tswj xyuas. Los ntawm kev tsim cov kev sib tham, tus neeg tsav tsheb pib ua tiav cov lus qhia. Nws ua raws li tus thawj coj lub neej thiab kev vam meej. Thaum lub sijhawm ua tiav HiveQL nqe lus, tus tsav tsheb khaws cov ntaub ntawv xav tau. Nws kuj ua haujlwm raws li cov ntaub ntawv lossis cov lus nug tau txais cov ntsiab lus tom qab Txo cov txheej txheem.
Sau - Nws ua tiav HiveQL query compilation. Cov lus nug tam sim no tau hloov mus rau qhov kev npaj ua tiav. Cov dej num tau teev tseg hauv txoj kev npaj. Nws kuj suav nrog cov kauj ruam uas MapReduce yuav tsum tau ua kom tau txais cov txiaj ntsig raws li tau txhais los ntawm cov lus nug. Cov lus nug tau hloov pauv mus rau Kev Tshawb Fawb Txog Kev Tshawb Fawb los ntawm Hive's compiler (AST). Hloov cov AST mus rau Directed Acyclic Graph tom qab kuaj xyuas kev sib raug zoo thiab suav nrog lub sijhawm tsis raug (DAG).
Qhov zoo - Nws optimizes DAG los ntawm kev ua cov kev hloov pauv sib txawv ntawm txoj kev npaj ua tiav. Nws muab cov kev hloov pauv rau kev txhim kho kev ua tau zoo, xws li tig lub raj xa dej ntawm kev koom ua ke rau hauv ib qho kev koom ua ke. Txhawm rau txhim kho kev nrawm, tus optimizer yuav faib cov dej num, xws li kev hloov pauv rau cov ntaub ntawv ua ntej ua haujlwm txo qis.
Neeg tua neeg - Tus executor khiav cov dej num thaum muab tso ua ke thiab optimization tiav. Cov haujlwm yog pipelined los ntawm Executor.
CLI, UI, thiab Thrift Server - Cov kab lus hais kom ua (CLI) yog tus neeg siv interface uas tso cai rau tus neeg siv sab nraud sib txuas lus nrog Hive. Hive's thrift server, zoo ib yam li JDBC lossis ODBC raws tu qauv, tso cai rau cov neeg siv sab nraud sib txuas lus nrog Hive ntawm lub network.
Security
Apache Hive tau koom ua ke nrog Hadoop kev ruaj ntseg, uas siv Kerberos rau cov neeg siv khoom sib koom ua pov thawj. HDFS dictates kev tso cai rau cov ntaub ntawv tsim tawm tshiab hauv Apache Hive, tso cai rau koj pom zoo los ntawm tus neeg siv, pab pawg, thiab lwm tus.
Ntsiab nta
- Hive txhawb cov rooj sab nraud, uas cia koj ua cov ntaub ntawv yam tsis tau khaws cia hauv HDFS.
- Nws kuj ua rau cov ntaub ntawv segmentation ntawm lub rooj theem kom ceev.
- Apache Hive ua tau zoo raws li Hadoop qhov kev xav tau qis-qib interface.
- Hive ua rau cov ntaub ntawv summarization, querying, thiab tsom xam yooj yim dua.
- HiveQL tsis tas yuav muaj kev txawj programming; kev nkag siab yooj yim ntawm SQL queries yog txaus.
- Peb kuj tuaj yeem siv Hive los ua cov lus nug ad-hoc rau kev txheeb xyuas cov ntaub ntawv.
- Nws yog scalable, paub, thiab adaptable.
- HiveQL tsis tas yuav muaj kev txawj programming; kev nkag siab yooj yim ntawm SQL queries yog txaus.
cov kev pab cuam
Apache Hive tso cai rau cov ntawv tshaj tawm hnub kawg, kev ntsuas kev lag luam txhua hnub, kev tshawb nrhiav ad-hoc, thiab kev txheeb xyuas cov ntaub ntawv. Cov kev nkag siab zoo muab los ntawm Apache Hive muab cov txiaj ntsig zoo sib tw thiab ua kom yooj yim rau koj los teb rau cov kev xav tau ntawm kev lag luam.
Nov yog qee qhov txiaj ntsig ntawm kev muaj cov ntaub ntawv zoo li no tau yooj yim:
- Yooj yim uas siv - Nrog nws cov lus zoo li SQL, querying cov ntaub ntawv yog yooj yim to taub.
- Accelerated cov ntaub ntawv nkag - Vim Apache Hive nyeem cov schema yam tsis tau txheeb xyuas hom lus lossis lub ntsiab lus txhais, cov ntaub ntawv tsis tas yuav tsum tau nyeem, txheeb xyuas, thiab txheeb xyuas rau disc hauv cov ntaub ntawv sab hauv. Nyob rau hauv sib piv, nyob rau hauv ib tug pa database, cov ntaub ntawv yuav tsum tau validated txhua zaus nws ntxiv.
- Superior scalability, yooj, thiab nqi-zoo - Vim tias cov ntaub ntawv khaws cia hauv HDFS, Apache Hive tuaj yeem tuav 100s ntawm petabytes ntawm cov ntaub ntawv, ua rau nws muaj kev xaiv ntau dua li cov ntaub ntawv raug. Apache Hive, raws li huab-raws li kev pabcuam Hadoop, tso cai rau cov neeg siv khoom nrawm nrawm nrawm thiab nqis virtual servers kom tau raws li kev hloov pauv haujlwm.
- Muaj peev xwm ua haujlwm ntau - Cov ntaub ntawv loj tuaj yeem daws txog 100,000 cov lus nug hauv ib teev.
Cov kev txwv
- Feem ntau, Apache Hive cov lus nug muaj latency siab heev.
- Subquery kev txhawb nqa yog txwv.
- Cov lus nug ntawm lub sijhawm tiag tiag thiab kev hloov pauv qib tsis muaj nyob hauv Apache Hive.
- Tsis muaj kev txhawb nqa rau materialized views.
- Hauv lub Hive, hloov tshiab thiab tshem tawm cov yeeb yam tsis txaus siab.
- Tsis yog npaj rau OLTP (online transitional process).
Pib nrog Apache Hive
Apache Hive yog tus khub Hadoop muaj zog uas ua kom yooj yim thiab ua kom koj cov haujlwm ua haujlwm. Txhawm rau kom tau txais txiaj ntsig zoo tshaj plaws ntawm Apache Hive, kev sib koom ua ke tsis sib haum yog qhov tseem ceeb. Thawj kauj ruam yog mus rau lub website.
1. Kev teeb tsa Hive los ntawm kev tso tawm ruaj khov
Pib los ntawm rub tawm qhov kev tso tawm ruaj khov tshiab tshaj plaws ntawm Hive los ntawm ib qho ntawm Apache rub daim iav (saib Hive tso). Tom qab ntawd lub tarball yuav tsum tau muab tshem tawm. Qhov no yuav tsim ib lub subfolder hu ua hive-xyz (qhov twg xyz yog tus naj npawb tso tawm):
Teem ib puag ncig hloov pauv HIVE_HOME kom taw tes rau cov ntawv qhia kev teeb tsa:
Thaum kawg, ntxiv $HIVE_HOME/bin rau koj PATH
:
2. Khiav Hive
Hive siv Hadoop, yog li:
- koj yuav tsum muaj Hadoop hauv koj txoj kev LOS YOG
3. Kev ua haujlwm DLL
Tsim Hive Rooj
tsim ib lub rooj hu ua pokes nrog ob kab, thawj ntawm uas yog tus lej thiab qhov thib ob yog ib txoj hlua.
Tshawb nrhiav los ntawm Tables
Sau Tag Nrho Cov Rooj
Hloov thiab tso cov ntxhuav
Cov npe ntawm lub rooj tuaj yeem hloov pauv thiab cov kab tuaj yeem ntxiv lossis hloov pauv:
Nws tsim nyog sau cia tias REPLACE COLUMNS hloov tag nrho cov kab uas twb muaj lawm thaum tsuas yog hloov lub rooj cov qauv thiab tsis yog cov ntaub ntawv. Ib haiv neeg SerDe yuav tsum tau siv rau hauv lub rooj. REPLACE COLUMNS kuj tseem siv tau los tshem tawm txhua kab los ntawm schema ntawm lub rooj:
Muab Cov Rooj
Muaj ntau yam haujlwm ntxiv thiab nta hauv Apache Hive uas koj tuaj yeem kawm txog los ntawm kev mus saib lub vev xaib raug cai.
xaus
Hive txhais yog cov ntaub ntawv kev pabcuam cuam tshuam rau kev nug thiab kev tshuaj xyuas rau cov ntaub ntawv loj loj uas tau tsim rau saum Apache Hadoop. Cov kws tshaj lij xaiv nws dua lwm cov kev pab cuam, cov cuab yeej, thiab software vim nws yog tsim los rau Hive uas nws kim heev cov ntaub ntawv thiab yooj yim rau siv.
Vam tias qhov kev qhia no pab koj pib nrog Apache Hive thiab ua rau koj cov haujlwm ua haujlwm tau zoo dua. Qhia rau peb paub hauv cov lus.
Sau ntawv cia Ncua