Natural Language Processing (NLP) ikuwona kusintha kwatsopano. Ndipo, ma dataset a Hugging Face ali patsogolo pa izi. M'nkhaniyi, tiwona tanthauzo la Hugging Face datasets.
Komanso, tiwona momwe angagwiritsire ntchito kuphunzitsa ndikuwunika mitundu ya NLP.
Hugging Face ndi kampani yomwe imapereka opanga ma dataset osiyanasiyana.
Kaya ndinu woyamba kapena katswiri wodziwa bwino za NLP, zomwe zaperekedwa pa Hugging Face zidzakuthandizani. Lowani nafe pamene tikufufuza gawo la NLP ndikuphunzira za kuthekera kwa dataset ya Hugging Face.
Choyamba, Kodi NLP ndi chiyani?
Natural Language Processing (NLP) ndi nthambi ya nzeru zochita kupanga. Imaphunzira momwe makompyuta amalumikizirana ndi zilankhulo za anthu (zachilengedwe). NLP imaphatikizapo kupanga zitsanzo zomwe zimatha kumvetsetsa ndikutanthauzira chilankhulo cha anthu. Chifukwa chake, ma algorithms amatha kuchita ntchito monga kumasulira zilankhulo, kusanthula malingaliro, ndi kupanga zolemba.
NLP imagwiritsidwa ntchito m'malo osiyanasiyana, kuphatikiza ntchito zamakasitomala, kutsatsa, komanso zaumoyo. Cholinga cha NLP ndikulola makompyuta kuti azitha kumasulira ndi kumvetsetsa chilankhulo cha anthu monga momwe chimalembedwera kapena kuyankhula m'njira yofanana ndi ya anthu.
Zambiri za Nkhope yokumbatirana
Nkhope yokumbatirana ndi bizinesi ya chilankhulo chachilengedwe (NLP) komanso bizinesi yaukadaulo yophunzirira makina. Amapereka zinthu zambiri zothandizira otukula kupititsa patsogolo gawo la NLP. Chinthu chawo chodziwika kwambiri ndi laibulale ya Transformers.
Zapangidwa kuti zizigwiritsidwa ntchito m'zinenero zachilengedwe. Komanso, imapereka zitsanzo zophunzitsidwa kale za ntchito zosiyanasiyana za NLP monga kumasulira zilankhulo ndi kuyankha mafunso.
Hugging Face, kuwonjezera pa laibulale ya Transformers, imapereka nsanja yogawana ma dataset ophunzirira makina. Izi zimapangitsa kuti zikhale zotheka kupeza msangamsanga wapamwamba ma datasets ophunzirira zitsanzo zawo.
Ntchito ya Hugging Face ndikupangitsa kuti makonzedwe azilankhulo zachilengedwe (NLP) athe kupezeka kwa opanga.
Ma Dataset Odziwika Kwambiri a Nkhope
Cornell Movie-Dialogs Corpus
Ili ndi gulu lodziwika bwino lochokera ku Hugging Face. Cornell Movie-Dialogs Corpus imakhala ndi zokambirana zomwe zimatengedwa pazithunzi zamakanema. Zinenero za Natural Language processing (NLP) zitha kuphunzitsidwa pogwiritsa ntchito kuchuluka kwa zolemba izi.
Kupitilira 220,579 zokambirana pakati pa 10,292 awiriawiri amakanema akuphatikizidwa mgululi.
Mutha kugwiritsa ntchito deta iyi pazinthu zosiyanasiyana za NLP. Mwachitsanzo, mukhoza kupanga kupanga chinenero ndi kuyankha mafunso. Komanso, mutha kupanga machitidwe a zokambirana. chifukwa zokambiranazo zimakhala ndi mitu yambiri. Dataset yagwiritsidwanso ntchito kwambiri muzofufuza.
Chifukwa chake, ichi ndi chida chothandiza kwambiri kwa ofufuza a NLP ndi opanga.
OpenWebText Corpus
OpenWebText Corpus ndi mndandanda wamasamba a pa intaneti omwe mungapeze pa nsanja ya Hugging Face. Chida ichi chili ndi masamba ambiri apaintaneti, monga zolemba, mabulogu, ndi mabwalo. Kupatula apo, onsewa adasankhidwa chifukwa chapamwamba kwambiri.
Deta ndi yofunika kwambiri pakuphunzitsa ndikuwunika mitundu ya NLP. Chifukwa chake, mutha kugwiritsa ntchito setiyi pazinthu monga kumasulira, ndi chidule. Komanso, mutha kusanthula malingaliro pogwiritsa ntchito deta iyi yomwe ndi yofunika kwambiri pamapulogalamu ambiri.
Gulu la Hugging Face linasankha OpenWebText Corpus kuti lipereke chitsanzo chapamwamba cha maphunziro. Ndi dataset yayikulu yokhala ndi zambiri zopitilira 570GB zamawu.
CHINSINSI
BERT (Bidirectional Encoder Representations kuchokera ku Transformers) ndi mtundu wa NLP. Yaphunzitsidwa kale ndipo ikupezeka pa nsanja ya Hugging Face. BERT idapangidwa ndi gulu la chilankhulo cha Google AI. Komanso, amaphunzitsidwa pagulu lalikulu la data kuti amvetsetse tanthauzo la mawu m'mawu.
Chifukwa BERT ndi mtundu wokhazikitsidwa ndi thiransifoma, imatha kukonza zolembera zonse nthawi imodzi m'malo mwa liwu limodzi panthawi. Amagwiritsa ntchito mawonekedwe a transformer tcheru njira kutanthauzira zolowera motsatizana.
Mbali imeneyi imathandiza BERT kuti azitha kumvetsa tanthauzo la mawu a m’mawu.
Mutha kugwiritsa ntchito BERT pakugawa zolemba, kumvetsetsa chilankhulo, dzina lake chizindikiritso, ndi kusamvana koyambira, pakati pa mapulogalamu ena a NLP. Komanso, ndizopindulitsa pakupanga zolemba ndikumvetsetsa kuwerenga kwamakina.
SQUAD
SQUAD (Stanford Question Answering Dataset) ndi nkhokwe ya mafunso ndi mayankho. Mutha kugwiritsa ntchito pophunzitsa mitundu yomvetsetsa yowerengera makina. Deta ili ndi mafunso ndi mayankho opitilira 100,000 pamitu yosiyanasiyana. SQUAD imasiyana ndi ma dataset am'mbuyomu.
Imayang'ana kwambiri pa mafunso omwe amafunikira chidziwitso cha nkhaniyo m'malo mongofanizira mawu osakira.
Zotsatira zake, ndi chida chabwino kwambiri chopangira ndi kuyesa zitsanzo zamayankhidwe a mafunso ndi ntchito zina zomvetsetsa makina. Anthu amalembanso mafunso mu SQUAD. Izi zimapereka mlingo wapamwamba wa khalidwe ndi kusasinthasintha.
Ponseponse, SQuAD ndi chida chofunikira kwa ofufuza ndi opanga NLP.
MNLI
MNLI, kapena Multi-Genre Natural Language Inference, ndi dataset yomwe imagwiritsidwa ntchito pophunzitsa ndi kuyesa mitundu yophunzirira makina kutengera chilankhulo chachilengedwe. Cholinga cha MNLI ndikuzindikira ngati zomwe zaperekedwa ndi zoona, zabodza, kapena zosalowerera ndale potengera mawu ena.
MNLI imasiyana ndi ma dataset am'mbuyomu chifukwa imakhala ndi zolemba zambiri zamitundu yambiri. Mitundu iyi imasiyanasiyana kuchokera ku zopeka kupita ku nkhani, komanso mapepala aboma. Chifukwa cha kusiyana kumeneku, MNLI ndi chitsanzo choyimira kwambiri cha zolemba zenizeni. Zikuoneka kuti ndizabwinoko kuposa zolemba zina zambiri zachilankhulo chachilengedwe.
Pokhala ndi milandu yopitilira 400,000 mu dataset, MNLI imapereka zitsanzo zambiri zamitundu yophunzitsira. Lilinso ndi ndemanga zachitsanzo chilichonse kuti zithandize zitsanzo pakuphunzira kwawo.
Maganizo Final
Pomaliza, ma dataset a Hugging Face ndi chida chamtengo wapatali kwa ofufuza ndi opanga NLP. Hugging Face imapereka chimango cha chitukuko cha NLP pogwiritsa ntchito magulu osiyanasiyana a data.
Tikuganiza kuti dataset yayikulu kwambiri ya Hugging Face ndi OpenWebText Corpus.
Zosungidwa zamtundu wapamwambazi zili ndi data yopitilira 570GB. Ndi chida chamtengo wapatali chophunzitsira ndikuwunika mitundu ya NLP. Mutha kuyesa kugwiritsa ntchito OpenWebText ndi ena pamapulojekiti anu otsatira.
Siyani Mumakonda