Natural Language Processing (NLP) iri kupupurira shanduko nyowani. Uye, Hugging Face datasets ari kumberi kwemaitiro aya. Muchikamu chino, tichatarisa kukosha kweHugging Face datasets.
Zvakare, isu tichaona mashandisirwo avanogona kushandiswa kudzidzisa uye kuongorora NLP modhi.
Hugging Face ikambani inopa vagadziri vane akasiyana dataset.
Kunyangwe iwe uri wekutanga kana nyanzvi yeNLP ine ruzivo, iyo data yakapihwa paHugging Face ichave yekushandisa kwauri. Joinha isu patinoongorora ndima yeNLP uye tidzidze nezve mukana weHugging Face dataset.
Chekutanga, Chii chinonzi NLP?
Natural Language Processing (NLP) ibazi re chakagadzirwa njere. Inodzidza kuti makomputa anodyidzana sei nemitauro yevanhu (yakasikwa). NLP inosanganisira kugadzira mamodheru anokwanisa kunzwisisa uye kududzira mutauro wevanhu. Nekudaro, maalgorithms anogona kuita mabasa akadai sekushandura mutauro, manzwiro ongororo, uye kugadzira zvinyorwa.
NLP inoshandiswa munzvimbo dzakasiyana siyana, kusanganisira basa revatengi, kushambadzira, uye hutano. Chinangwa cheNLP ndechekubvumira makomputa kududzira uye kunzwisisa mutauro wevanhu sekunyorwa kwawakaitwa kana kutaurwa nenzira iri pedyo neyevanhu.
Kuongorora kwe Chiso chakambundirana
Chiso chakambundirana ibhizimusi rekugadzira mutauro (NLP) uye tekinoroji yekudzidza muchina. Ivo vanopa huwandu hwakasiyana hwezviwanikwa kubatsira vanogadzira mukufambisira mberi nzvimbo yeNLP. Chigadzirwa chavo chinonyanya kukosha iTransformers raibhurari.
Yakagadzirirwa kushandiswa kwemutauro wechisikigo. Zvakare, inopa pre-yakadzidziswa mhando dzeakasiyana eNLP mabasa akadai sekushandura mutauro uye kupindura mibvunzo.
Hugging Face, mukuwedzera kune Transformers raibhurari, inopa chikuva chekugovana-muchina-yekudzidza dataset. Izvi zvinoita kuti zvikwanise kukurumidza kuwana zvemhando yepamusoro datasets yekudzidziswa mienzaniso yavo.
Chinangwa cheHugging Face ndechekuita kuti mutauro wechisikigo kugadzirisa (NLP) kuwanikwe kune vanogadzira.
Anonyanya Kuzivikanwa Hugging Face Datasets
Cornell Movie-Dialogs Corpus
Iri ndiro rinozivikanwa dataset kubva kuHugging Face. Cornell Movie-Dialogs Corpus inosanganisira nhaurirano dzakatorwa kubva mumafirimu screenplays. Natural language processing (NLP) modhi inogona kudzidziswa uchishandisa iyi yakawanda yakawanda yedata data.
Zvinopfuura 220,579 dialog kusangana pakati pe10,292 bhaisikopo vatambi vaviri vanosanganisirwa muunganidzwa.
Iwe unogona kushandisa iyi dataset kune akasiyana eNLP mabasa. Semuenzaniso, unogona kukudziridza kugadzira mutauro uye mapurojekiti ekupindura mibvunzo. Zvakare, unogona kugadzira dialogue systems. nokuti hurukuro dzacho dzinobata nyaya dzakasiyana-siyana. Iyo dataset zvakare yakashandiswa zvakanyanya mumapurojekiti ekutsvaga.
Nekudaro, ichi chishandiso chinobatsira kwazvo kune vaongorori veNLP nevagadziri.
OpenWebText Corpus
Iyo OpenWebText Corpus muunganidzwa wemapeji epamhepo aunogona kuwana paHugging Face chikuva. Iyi dataset inosanganisira akawanda mapeji epamhepo, akadai sezvinyorwa, mablog, uye maforamu. Kunze kwezvo, izvi zvose zvakasarudzwa nokuda kwemhando yavo yepamusoro.
Iyo dataset inonyanya kukosha pakudzidzisa uye kuongorora NLP modhi. Saka, unogona kushandisa dhatabheti iyi kuita mabasa akaita seshanduro, uye muchidimbu. Zvakare, iwe unogona kuita ongororo yemanzwiro uchishandisa iyi dataset inova chinhu chikuru kune akawanda maapplication.
Chikwata cheHugging Face chakabata OpenWebText Corpus kuti ipe yemhando yepamusoro sampuli yekudzidziswa. Iyo yakakura dataset ine inodarika 570GB yedata data.
BATA
BERT (Bidirectional Encoder Representations kubva kuTransformers) imhando yeNLP. Yakave isati yadzidziswa uye inowanikwa paHugging Face chikuva. BERT yakagadzirwa neGoogle AI Mutauro timu. Zvakare, inodzidziswa pane yakakura mameseji dataset kuti ibate mamiriro emashoko mumutsara.
Nekuti BERT ishanduro-yakavakirwa modhi, inokwanisa kugadzirisa iyo yakazara yekuisa kutevedzana kamwechete panzvimbo yezwi rimwe panguva. A transformer-based model inoshandisa kutarisisa nzira kududzira zvinotevedzana.
Ichi chinogonesa BERT kuti ibate mamiriro emashoko mumutsara.
Unogona kushandisa BERT kurongedza zvinyorwa, kunzwisisa mutauro, zita zita kuzivikanwa, uye coreference resolution, pakati pezvimwe zveNLP application. Zvakare, zvinobatsira mukugadzira zvinyorwa uye kunzwisisa kuverenga kwemuchina.
SQUAD
SQUAD (Stanford Mubvunzo Kupindura Dataset) dhatabhesi yemibvunzo nemhinduro. Unogona kuishandisa kudzidzisa mamodheru ekunzwisisa kuverenga kwemichina. Iyo dataset inosanganisira inopfuura 100,000 mibvunzo uye mhinduro pamhando dzakasiyana dzemisoro. SQUAD inosiyana neyakapfuura dataset.
Inotarisa pamibvunzo inoda ruzivo rwechirevo chechinyorwa pane kungofananidza mazwi akakosha.
Nekuda kweizvozvo, iyo yakanakisa sosi yekugadzira uye yekuyedza modhi yekupindura-mubvunzo uye mamwe mabasa emuchina-kunzwisisa. Vanhu vanonyora mibvunzo muSQUAD zvakare. Izvi zvinopa huwandu hwepamusoro hwemhando uye kuenderana.
Pakazere, SQuAD chinhu chakakosha sosi yevatsvagiri veNLP nevagadziri.
MNLI
MNLI, kana Multi-Genre Natural Language Inference, idhata rinoshandiswa kudzidzisa uye kuyedza michina yekudzidza mamodheru pakutaura kwemutauro wechisikigo. Chinangwa cheMNLI ndechekuona kana chirevo chakapihwa chiri chechokwadi, nhema, kana kwazvakarerekera muchiedza chechimwe chirevo.
MNLI inosiyana neyakapfuura dhataseti nekuti inovhara huwandu hwakawanda hwezvinyorwa kubva kumhando dzakawanda. Mhando idzi dzinosiyana kubva kungano kuenda kunhau, uye mapepa ehurumende. Nekuda kwekusiyana uku, MNLI inomiririra muenzaniso wezvinyorwa zvepasirese. Zviripachena zviri nani pane mamwe akawanda echisikigo mitauro inference datasets.
Nemakesi anopfuura mazana mana ezviuru ari mudhatabheti, MNLI inopa huwandu hwakakosha hwemienzaniso yemhando dzekudzidzisa. Iine zvakare zvirevo zvemuenzaniso wega wega kubatsira mamodheru mukudzidza kwavo.
Final Thoughts
Chekupedzisira, Hugging Face datasets chinhu chakakosha sosi yevatsvagiri veNLP nevagadziri. Hugging Face inopa hwaro hwekuvandudza kweNLP nekushandisa boka rakasiyana redataset.
Isu tinofunga Hugging Face's hombe dataset ndeye OpenWebText Corpus.
Iyi dataset yemhando yepamusoro ine inodarika 570GB yedata data. Icho chinhu chakakosha sosi yekudzidzisa uye kuongorora NLP modhi. Unogona kuedza kushandisa OpenWebText nevamwe mumapurojekiti ako anotevera.
Leave a Reply