I-Natural Language Processing (NLP) ibona igagasi elitsha lophuculo. Kwaye, i-Hugging Face datasets ziphambili kulo mkhwa. Kweli nqaku, siza kujonga ukubaluleka kweeseti zedatha yeHugging Face.
Kwakhona, siza kubona ukuba zingasetyenziswa njani ukuqeqesha nokuvavanya iimodeli ze-NLP.
Ubuso beHugging yinkampani enikezela abaphuhlisi ngeentlobo ezahlukeneyo zedatha.
Nokuba ungumntu oqalayo okanye ingcali ye-NLP enamava, idatha enikezelweyo kwiHugging Face iya kusetyenziswa kuwe. Sijoyine njengoko siphonononga icandelo le-NLP kwaye sifunde malunga nokubanakho kweeseti zedatha yoBuso obuHugging.
Okokuqala, Yintoni i-NLP?
I-Natural Language Processing (NLP) lisebe le kukubhadla okungeyonyani. Ifunda indlela iikhompyuter ezinxibelelana ngayo neelwimi zabantu (zendalo). I-NLP ibandakanya ukuyila imifuziselo ekwaziyo ukuqonda nokutolika ulwimi lwabantu. Ke, ii-algorithms zinokwenza imisebenzi efana nokuguqulela ulwimi, Uhlalutyo lweemvakalelo, kunye nokuveliswa kokubhaliweyo.
I-NLP isetyenziswa kwiindawo ezahlukeneyo, kubandakanya inkonzo yabathengi, ukuthengisa, kunye nokhathalelo lwempilo. Injongo ye-NLP kukuvumela iikhompyutha ukuba zitolike kwaye ziqonde ulwimi lwabantu njengoko lubhalwa okanye luthethwa ngendlela esondele kweyabantu.
Ushwankathelo lwe Ukujongana nobuso
Ukujongana nobuso yinkqubo yolwimi lwendalo (NLP) kunye neshishini lobuchwepheshe bokufunda ngomatshini. Babonelela ngoluhlu olubanzi lwezixhobo zokunceda abaphuhlisi ekuqhubeleni phambili ummandla we-NLP. Eyona mveliso yabo ibalulekileyo lithala leencwadi leTransformers.
Yenzelwe ukusetyenziswa kolwimi lwendalo. Kwakhona, ibonelela ngeemodeli eziqeqeshwe kwangaphambili kwimisebenzi eyahlukeneyo ye-NLP efana nokuguqulela ulwimi kunye nokuphendula imibuzo.
Ubuso obukwangayo, ukongeza kwithala leencwadi leTransformers, libonelela ngeqonga lokwabelana ngeeseti zedatha zokufunda ngomatshini. Oku kwenza kube lula ukufikelela ngokukhawuleza umgangatho ophezulu iiseti zedatha zoqeqesho iimodeli zabo.
Injongo yeHugging Face kukwenza inkqubo yolwimi lwendalo (NLP) ifikeleleke ngakumbi kubaphuhlisi.
Uninzi lweeSeti zeDatha zoBuso obuBoniweyo obudumileyo
Cornell Movie-Dialogs Corpus
Le yidatha eyaziwayo evela kwiHugging Face. ICornell Movie-Dialogs Corpus iquka iingxoxo ezithathwe kumboniso bhanyabhanya. Imifuziselo yokusetyenzwa kolwimi lwendalo (NLP) inokuqeqeshwa ngokusetyenziswa kolu luhlu lubanzi lwedatha yokubhaliweyo.
Ngaphezulu kwe-220,579 yencoko yababini phakathi kwe-10,292 izibini zabalinganiswa bebhanyabhanya zibandakanyiwe kwingqokelela.
Ungasebenzisa le datha kwiindidi zemisebenzi ye-NLP. Umzekelo, unokuphuhlisa indalo yolwimi kunye neeprojekthi zokuphendula imibuzo. Kwakhona, ungenza iinkqubo zencoko yababini. kuba iintetho zigubungela imixholo ebanzi ngolo hlobo. Uluhlu lwedatha lukwasetyenziswe kakhulu kwiiprojekthi zophando.
Ke, esi sisixhobo esiluncedo kakhulu kubaphandi kunye nabaphuhlisi be-NLP.
I-OpenWebText Corpus
I-OpenWebText Corpus yingqokelela yamaphepha e-intanethi onokuwafumana kwiqonga loBubuso beHugging. Le datha iquka uluhlu olubanzi lwamaphepha e-intanethi, afana namanqaku, iiblogi, kunye neeforamu. Ngaphandle koko, zonke ezi zikhethwa ngenxa yomgangatho wazo ophezulu.
Uluhlu lwedatha luxabiseke ngakumbi kuqeqesho nokuvavanya iimodeli ze-NLP. Kungoko, ungasebenzisa le datha yedatha kwimisebenzi efana nenguqulelo, kunye noshwankathelo. Kwakhona, unokwenza uhlalutyo lweemvakalelo usebenzisa le datha eyi-asethi enkulu yezicelo ezininzi.
Iqela le-Hugging Face licuthe i-OpenWebText Corpus ukubonelela ngesampula ephezulu yoqeqesho. Yidatha enkulu engaphezulu kwe-570GB yedatha yombhalo.
BHALA
I-BERT (i-Bidirectional Encoder Representations ezivela kwiTransformers) ngumzekelo we-NLP. Kuye kwaqeqeshwa kwangaphambili kwaye kufikeleleke kwiqonga loBubuso beHugging. I-BERT yenziwe liqela likaGoogle le-AI yoLwimi. Kwakhona, iqeqeshelwa isethi yedatha yombhalo omkhulu ukubamba umxholo wamagama kwibinzana.
Ngenxa yokuba i-BERT iyimodeli esekwe kwi-transformer, inokuqhubekeka ngokulandelelana kwegalelo elipheleleyo kanye endaweni yegama elinye ngexesha. Kusetyenziswa imodeli esekwe kwi-transformer iindlela zokuqwalaselwa ukutolika igalelo elilandelelanayo.
Olu phawu lwenza i-BERT ibambe umxholo wamagama kwibinzana.
Ungasebenzisa i-BERT yokwahlulahlula okubhaliweyo, ukuqonda ulwimi, into ekhoyo ukuchongwa, kunye nesisombululo esingundoqo, phakathi kwezinye izicelo ze-NLP. Kwakhona, kunenzuzo ekuveliseni isicatshulwa kunye nokuqonda ukufundwa koomatshini.
SQUAD
I-SQUAD (Iseti yedatha yokuphendula yemibuzo yaseStanford) sisiseko semibuzo kunye neempendulo. Ungayisebenzisa ukuqeqesha iimodeli zokuqonda ukufunda komatshini. Uluhlu lwedatha lubandakanya imibuzo engaphezu kwe-100,000 kunye neempendulo kwizihloko ezahlukeneyo. I-SQUAD iyahluka kwiiseti zedatha zangaphambili.
Igxininisa kwimibuzo efuna ulwazi ngomxholo wesicatshulwa kunokuthelekisa nje amagama angundoqo.
Ngenxa yoko, sisixhobo esihle kakhulu sokudala kunye nokuvavanya iimodeli zokuphendula imibuzo kunye neminye imisebenzi yokuqonda umatshini. Abantu babhala imibuzo kwi-SQUAD ngokunjalo. Oku kunika umgangatho ophezulu womgangatho kunye nokuhambelana.
Ngokubanzi, i-SQuAD sisixhobo esibalulekileyo kubaphandi kunye nabaphuhlisi be-NLP.
MNLI
I-MNLI, okanye i-Multi-Genre Natural Language Inference, yidathasethi esetyenziselwa ukuqeqesha nokuvavanya iimodeli zokufunda ngomatshini ukuthelekelela ulwimi lwendalo. Injongo ye-MNLI kukuchonga ukuba inkcazo enikiweyo iyinyani, ayiyonyani, okanye ayithathi hlangothi ekukhanyeni kwenye ingxelo.
I-MNLI ihluke kwiiseti zedatha zangaphambili kuba iquka uluhlu olubanzi lweetekisi ezivela kwiindidi ezininzi. Ezi ntlobo ziyahluka ukusuka kwiintsomi ukuya kumaphepha eendaba, kunye namaphepha karhulumente. Ngenxa yolu tshintsho, i-MNLI yisampuli emele ngakumbi yombhalo wehlabathi lokwenene. Ngokucacileyo ingcono kunezinye iiseti zedatha ezithelekelelwayo zolwimi lwendalo.
Ngaphezulu kwe-400,000 yeemeko kwidathasethi, i-MNLI ibonelela ngenani elibalulekileyo lemizekelo yoqeqesho. Ikwanayo namagqabantshintshi kwisampulu nganye ukunceda iimodeli ekufundeni kwabo.
Iingcinga Final
Okokugqibela, i-Hugging Face datasets zisisixhobo esixabiseke kakhulu kubaphandi kunye nabaphuhlisi be-NLP. Ubuso beHugging bubonelela ngesakhelo sophuhliso lwe-NLP ngokusebenzisa iqela elahlukileyo leedatha.
Sicinga ukuba eyona datha iseti yeHugging Face yi-OpenWebText Corpus.
Le datha ekumgangatho ophezulu iqulethe ngaphezulu kwe-570GB yedatha yombhalo. Sisixhobo esixabisekileyo soqeqesho nokuvavanya iimodeli ze-NLP. Unokuzama ukusebenzisa i-OpenWebText kunye nezinye kwiiprojekthi zakho ezilandelayo.
Shiya iMpendulo