Sarrafa Harshen Halitta (NLP) yana ganin sabon yanayin haɓakawa. Kuma, Rungumar Face datasets sune kan gaba a wannan yanayin. A cikin wannan labarin, za mu dubi mahimmancin bayanan Rungumar fuska.
Hakanan, zamu ga yadda za'a iya amfani da su don horarwa da tantance samfuran NLP.
Hugging Face kamfani ne da ke ba wa masu haɓaka bayanai iri-iri.
Ko kai mafari ne ko ƙwararren ƙwararren NLP, bayanan da aka bayar akan Fuskar Hugging za su yi amfani a gare ku. Kasance tare da mu yayin da muke bincika filin NLP kuma muna koyo game da yuwuwar rungumar bayanan bayanan Fuskar.
Da farko, Menene NLP?
Tsarin Harshen Halitta (NLP) reshe ne na wucin gadi hankali. Yana nazarin yadda kwamfutoci ke hulɗa da harsunan ɗan adam (na halitta). NLP ya ƙunshi ƙirƙira samfura masu iya fahimta da fassara harshen ɗan adam. Don haka, algorithms na iya ɗaukar ayyuka kamar fassarar harshe, tantance tunani, da kuma samar da rubutu.
Ana amfani da NLP a fannoni daban-daban, gami da sabis na abokin ciniki, talla, da kiwon lafiya. Manufar NLP ita ce ƙyale kwamfutoci su fassara da fahimtar harshen ɗan adam kamar yadda aka rubuta ko magana ta hanyar da ta dace da ta mutane.
Bayani na Fuskar Hugging
Fuskar Hugging kasuwanci ne na sarrafa harshe na halitta (NLP) da kuma kasuwancin fasahar koyon injin. Suna ba da albarkatu masu yawa don taimakawa masu haɓakawa don haɓaka yankin NLP. Mafi kyawun samfurin su shine ɗakin karatu na Transformers.
An tsara shi don aikace-aikacen sarrafa harshe na halitta. Hakanan, yana ba da samfuran da aka riga aka horar don ayyuka daban-daban na NLP kamar fassarar harshe da amsa tambaya.
Hugging Face, ban da ɗakin karatu na Transformers, yana ba da dandamali don raba bayanan koyo na inji. Wannan yana ba da damar samun dama ga inganci da sauri bayanai don horarwa su model.
Hugging Face manufa shi ne don samar da sarrafa harshe na halitta (NLP) mafi dacewa ga masu haɓakawa.
Mafi Shahararrun Matsalolin Fuskar Runguma
Cornell Movie-Dialogs Corp
Wannan sanannen saitin bayanai ne daga Fuskar Hugging. Cornell Movie-Dialogs Corpus ya ƙunshi maganganun da aka ɗauka daga wasan kwaikwayo na fim. Za a iya horar da ƙirar sarrafa harshe na halitta (NLP) ta amfani da wannan adadi mai yawa na bayanan rubutu.
Fiye da haduwar maganganu 220,579 tsakanin nau'ikan halayen fim 10,292 an haɗa su a cikin tarin.
Kuna iya amfani da wannan bayanan don ayyukan NLP iri-iri. Misali, zaku iya haɓaka ƙirƙirar harshe da ayyukan amsa tambayoyi. Hakanan, zaku iya ƙirƙirar tsarin tattaunawa. domin tattaunawar ta kunshi batutuwa da dama. An kuma yi amfani da tarin bayanan sosai a ayyukan bincike.
Don haka, wannan kayan aiki ne mai matukar amfani ga masu binciken NLP da masu haɓakawa.
BudeWebText Corpus
OpenWebText Corpus tarin shafuka ne na kan layi wanda zaku iya samu akan dandalin Hugging Face. Wannan saitin bayanai ya haɗa da faffadan shafuka na kan layi, kamar labarai, shafukan yanar gizo, da kuma taron tattaunawa. Bayan haka, an zaɓi waɗannan duka don ingancin su.
Saitin bayanan yana da mahimmanci musamman don horarwa da kimanta samfuran NLP. Don haka, zaku iya amfani da wannan bayanan don ayyuka kamar fassarar, da taƙaitawa. Hakanan, zaku iya yin nazarin jin daɗi ta amfani da wannan bayanan da ke da babbar kadara ga aikace-aikace da yawa.
Ƙungiyar Hugging Face ta ƙaddamar da OpenWebText Corpus don samar da samfur mai inganci don horo. Yana da babban ma'aunin bayanai tare da fiye da 570GB na bayanan rubutu.
BERT
BERT (Wakilin Encoder na Bidirectional daga Masu Canzawa) ƙirar NLP ce. An riga an horar da shi kuma ana samun dama ga dandalin Hugging Face. Ƙungiyar Harshen Google AI ce ta ƙirƙira BERT. Har ila yau, an horar da shi akan ɗimbin bayanan rubutu don fahimtar mahallin kalmomi a cikin jimla.
Saboda BERT samfuri ne na tushen wutan lantarki, yana iya aiwatar da cikakken tsarin shigarwa lokaci ɗaya maimakon kalma ɗaya a lokaci guda. Ana amfani da samfurin tushen wutan lantarki hanyoyin kulawa don fassara shigarwar jere.
Wannan fasalin yana ba BERT damar fahimtar mahallin kalmomi a cikin jumla.
Kuna iya amfani da BERT don rarraba rubutu, fahimtar harshe, mai suna ganewa, da ƙudurin mahimmanci, a tsakanin sauran aikace-aikacen NLP. Hakanan, yana da fa'ida wajen samar da rubutu da fahimtar karatun inji.
SQAD
SQuAD (Stanford Question Amswering Dataset) rumbun adana bayanai ne na tambayoyi da amsoshi. Kuna iya amfani da shi don horar da ƙirar fahimtar karatun inji. Saitin bayanan ya ƙunshi tambayoyi sama da 100,000 da amsoshi akan batutuwa daban-daban. SQuAD ya bambanta da bayanan da suka gabata.
Yana mai da hankali kan tambayoyin da ke buƙatar sanin mahallin rubutun maimakon madaidaitan kalmomi kawai.
Sakamakon haka, kyakkyawan hanya ce don ƙirƙira da gwada samfura don amsa tambayoyi da sauran ayyukan fahimtar na'ura. Mutane suna rubuta tambayoyin a cikin SQuAD kuma. Wannan yana ba da babban darajar inganci da daidaito.
Gabaɗaya, SQuAD hanya ce mai mahimmanci ga masu binciken NLP da masu haɓakawa.
MNLI
MNLI, ko Multi-Genre Natural Language Inference, shine tsarin bayanai da ake amfani dashi don horarwa da gwadawa samfurin koyo na inji don fahimtar harshen yanayi. Manufar MNLI ita ce gano ko bayanin da aka bayar gaskiya ne, ƙarya, ko tsaka tsaki dangane da wata magana.
MNLI ya bambanta da bayanan da suka gabata domin ya ƙunshi nau'ikan rubutu da yawa daga nau'o'i da yawa. Waɗannan nau'ikan sun bambanta daga almara zuwa labaran labarai, da takaddun gwamnati. Saboda wannan bambancin, MNLI shine mafi wakilcin samfurin rubutu na ainihi. A bayyane yake ya fi sauran manyan bayanan ƙididdiga na harshe na halitta.
Tare da shari'o'i sama da 400,000 a cikin bayanan, MNLI tana ba da adadi mai yawa na misalai don ƙirar horo. Hakanan yana ƙunshe da sharhi ga kowane samfurin don taimakawa samfura a cikin koyonsu.
Final Zamantakewa
A ƙarshe, Rungumar Face datasets wata hanya ce mai kima ga masu binciken NLP da masu haɓakawa. Hugging Face yana ba da tsari don ci gaban NLP ta hanyar amfani da rukuni daban-daban na saitin bayanai.
Muna tunanin Hugging Face mafi girman saitin bayanai shine OpenWebText Corpus.
Wannan babban ma'aunin bayanai ya ƙunshi sama da 570GB na bayanan rubutu. Hanya ce mai kima don horarwa da kimanta samfuran NLP. Kuna iya gwada amfani da OpenWebText da sauransu a cikin ayyukanku na gaba.
Leave a Reply