Natural Language Processing (NLP) tab tom pom ib nthwv dej tshiab ntawm kev txhim kho. Thiab, Hugging Face datasets yog nyob rau hauv pem hauv ntej ntawm tus qauv no. Hauv kab lus no, peb yuav saib qhov tseem ceeb ntawm Hugging Face datasets.
Tsis tas li ntawd, peb yuav pom tias lawv yuav siv li cas los cob qhia thiab ntsuas cov qauv NLP.
Hugging Face yog ib lub tuam txhab uas muab cov neeg tsim khoom nrog ntau yam ntaub ntawv.
Txawm hais tias koj yog tus pib lossis tus kws paub txog NLP tus kws tshaj lij, cov ntaub ntawv muab ntawm Hugging Face yuav siv tau rau koj. Koom nrog peb thaum peb tshawb fawb txog NLP thiab kawm txog lub peev xwm ntawm Hugging Face datasets.
Ua ntej, NLP yog dab tsi?
Natural Language Processing (NLP) yog ib ceg ntawm artificial txawj ntse. Nws kawm seb computers cuam tshuam nrog tib neeg (natural) hom lus li cas. NLP suav nrog tsim cov qauv uas muaj peev xwm nkag siab thiab txhais lus tib neeg. Li no, algorithms tuaj yeem ua cov haujlwm xws li kev txhais lus, txoj kev xav hauv nruab siab, thiab text production.
NLP yog siv nyob rau hauv ntau qhov chaw, nrog rau cov neeg siv khoom, kev lag luam, thiab kev kho mob. Lub hom phiaj ntawm NLP yog tso cai rau cov khoos phis tawj los txhais lus thiab nkag siab txog tib neeg cov lus raws li nws tau sau lossis hais tau zoo ib yam li ze rau tib neeg.
Txheej txheem cej luam ntawm Sib Fim Ntsej Muag
Sib Fim Ntsej Muag yog ib tug natural language processing (NLP) thiab machine learning technology lag luam. Lawv muab ntau yam kev pab cuam los pab cov neeg tsim khoom ntxiv rau thaj tsam ntawm NLP. Lawv cov khoom tseem ceeb tshaj plaws yog lub tsev qiv ntawv Transformers.
Nws yog tsim los rau tej yam ntuj tso lus siv. Tsis tas li ntawd, nws muab cov qauv kev cob qhia ua ntej rau ntau yam haujlwm NLP xws li kev txhais lus thiab cov lus nug teb.
Hugging ntsej muag, ntxiv rau lub tsev qiv ntawv Transformers, muaj lub platform rau kev sib koom cov ntaub ntawv kawm tshuab. Qhov no ua rau nws muaj peev xwm nkag tau sai sai cov ntaub ntawv rau kev cob qhia lawv cov qauv.
Hugging Face lub hom phiaj yog ua kom cov lus ua tau zoo (NLP) siv tau rau cov neeg tsim khoom.
Nrov Nrov Hugging Face Datasets
Cornell Movie-Dialogs Corpus
Nov yog cov ntaub ntawv paub zoo los ntawm Hugging Face. Cornell Movie-Dialogs Corpus suav nrog kev sib tham los ntawm cov yeeb yaj kiab screenplays. Cov qauv ua hom lus (NLP) yuav raug cob qhia siv cov ntaub ntawv sau ntau npaum li cas.
Ntau tshaj 220,579 qhov kev sib tham sib ntsib ntawm 10,292 tus yeeb yam ua yeeb yam yog suav nrog hauv kev sau.
Koj tuaj yeem siv cov ntaub ntawv no rau ntau yam haujlwm NLP. Piv txwv li, koj tuaj yeem tsim cov lus tsim thiab cov lus nug teb. Tsis tas li, koj tuaj yeem tsim cov kev sib tham. vim hais tias cov lus hais txog ntau yam ntawm cov ncauj lus. Cov ntaub ntawv tseem tau siv dav hauv cov haujlwm tshawb fawb.
Li no, qhov no yog qhov cuab yeej muaj txiaj ntsig zoo rau NLP cov kws tshawb fawb thiab cov tsim tawm.
OpenWebText Corpus
Lub OpenWebText Corpus yog ib phau ntawm nplooj ntawv online uas koj tuaj yeem nrhiav tau ntawm Hugging Face platform. Cov ntaub ntawv no suav nrog ntau cov nplooj ntawv online, xws li cov ntawv, blogs, thiab cov rooj sib tham. Tsis tas li ntawd, cov no tau raug xaiv rau lawv cov khoom zoo.
Cov ntaub ntawv tseem ceeb tshwj xeeb tshaj yog rau kev cob qhia thiab ntsuas cov qauv NLP. Li no, koj tuaj yeem siv cov ntaub ntawv no rau kev ua haujlwm xws li kev txhais lus, thiab sau cov ntsiab lus. Tsis tas li, koj tuaj yeem ua qhov kev ntsuam xyuas kev xav siv cov ntaub ntawv no uas yog cov cuab yeej cuab tam loj rau ntau daim ntawv thov.
Pab Pawg Hugging Face tau kho qhov OpenWebText Corpus los muab cov qauv zoo rau kev cob qhia. Nws yog cov ntaub ntawv loj nrog ntau dua 570GB ntawm cov ntaub ntawv ntawv.
YOB
BERT (Bidirectional Encoder Sawv cev los ntawm Transformers) yog NLP qauv. Nws tau raug cob qhia ua ntej thiab nkag mus rau ntawm Hugging Face platform. BERT tau tsim los ntawm pab pawg Google AI Language. Tsis tas li ntawd, nws raug cob qhia ntawm cov ntaub ntawv loj heev kom nkag siab cov ntsiab lus ntawm cov lus hauv ib kab lus.
Vim hais tias BERT yog ib tug transformer-raws li qauv, nws muaj peev xwm ua tau tag nrho cov input sequence ib zaug es tsis txhob ntawm ib lo lus ntawm ib lub sij hawm. Ib tug transformer-raws li qauv siv saib xyuas mechanisms los txhais cov lus qhia ua ntu zus.
Qhov no ua rau BERT nkag siab cov ntsiab lus ntawm cov lus hauv ib kab lus.
Koj tuaj yeem siv BERT rau kev faib cov ntawv nyeem, nkag siab hom lus, npe entity Kev txheeb xyuas, thiab kev daws teeb meem tseem ceeb, ntawm lwm cov ntawv thov NLP. Tsis tas li ntawd, nws muaj txiaj ntsig zoo hauv kev tsim cov ntawv nyeem thiab nkag siab lub tshuab nyeem.
SQUAD
SQuAD (Stanford Question Answering Dataset) yog cov ntaub ntawv ntawm cov lus nug thiab cov lus teb. Koj tuaj yeem siv nws los cob qhia tshuab nyeem kev nkag siab ua qauv. Cov ntaub ntawv suav nrog ntau dua 100,000 cov lus nug thiab cov lus teb ntawm ntau lub ncauj lus. SQuAD txawv ntawm cov ntaub ntawv dhau los.
Nws tsom rau cov lus nug uas xav tau kev paub ntawm cov ntawv nyeem cov ntsiab lus es tsis yog cov ntsiab lus sib piv.
Raws li qhov tshwm sim, nws yog ib qho khoom siv zoo heev rau kev tsim thiab sim cov qauv rau cov lus nug- teb thiab lwm yam kev nkag siab ntawm lub tshuab. Tib neeg sau cov lus nug hauv SQUAD thiab. Qhov no muab cov qib siab zoo thiab sib xws.
Zuag qhia tag nrho, SQuAD yog qhov muaj txiaj ntsig zoo rau NLP cov kws tshawb fawb thiab cov tsim tawm.
MNLI
MNLI, lossis Multi-Genre Natural Language Inference, yog cov ntaub ntawv siv los qhia thiab sim tshuab kev kawm ua qauv rau natural language inference. Lub hom phiaj ntawm MNLI yog txhawm rau txheeb xyuas seb cov lus hais puas muaj tseeb, tsis tseeb, lossis nruab nrab ntawm lwm nqe lus.
MNLI txawv ntawm cov ntaub ntawv dhau los hauv qhov uas nws suav nrog ntau cov ntawv nyeem los ntawm ntau hom. Cov hom ntawv no txawv ntawm cov ntawv tseeb rau cov ntawv xov xwm, thiab cov ntawv tseem ceeb. Vim tias qhov kev hloov pauv no, MNLI yog ib qho piv txwv ntxiv ntawm cov ntawv nyeem hauv ntiaj teb. Nws yog pom tau tias zoo dua li ntau lwm yam lus inference datasets.
Nrog ntau dua 400,000 tus neeg mob hauv cov ntaub ntawv, MNLI muab cov piv txwv tseem ceeb rau cov qauv kev cob qhia. Nws kuj muaj cov lus qhia rau txhua tus qauv los pab cov qauv hauv lawv txoj kev kawm.
kawg
Thaum kawg, Hugging Face datasets yog qhov muaj txiaj ntsig zoo rau NLP cov kws tshawb fawb thiab cov tsim tawm. Hugging Face muab lub hauv paus rau kev txhim kho NLP los ntawm kev siv ntau pawg ntawm cov ntaub ntawv.
Peb xav tias Hugging Face cov ntaub ntawv loj tshaj plaws yog OpenWebText Corpus.
Cov ntaub ntawv zoo no muaj ntau dua 570GB ntawm cov ntawv nyeem. Nws yog qhov muaj txiaj ntsig zoo rau kev cob qhia thiab ntsuas cov qauv NLP. Koj tuaj yeem sim siv OpenWebText thiab lwm tus hauv koj cov haujlwm tom ntej.
Sau ntawv cia Ncua