Ke ʻike nei ʻo Natural Language Processing (NLP) i kahi nalu hou o nā hoʻomaikaʻi. A, ʻo Hugging Face datasets ke poʻo o kēia ʻano. Ma kēia ʻatikala, e nānā mākou i ke ʻano o ka hugging Face datasets.
Eia kekahi, e ʻike mākou pehea e hoʻohana ʻia ai lākou e hoʻomaʻamaʻa a loiloi i nā hiʻohiʻona NLP.
ʻO Hugging Face kahi hui e hoʻolako i nā mea hoʻomohala me nā ʻano ʻikepili like ʻole.
Inā he mea hoʻomaka ʻoe a i ʻole he loea NLP ʻike, ʻo ka ʻikepili i hāʻawi ʻia ma Hugging Face e hoʻohana iā ʻoe. E hui pū me mākou e ʻimi ana i ke kahua o NLP a aʻo e pili ana i ka hiki o ka Hugging Face datasets.
ʻO ka mea mua, he aha ka NLP?
He lālā o ka Natural Language Processing (NLP). ʻimi hoʻopunipuni manao. E aʻo ana i ka pili ʻana o nā kamepiula me nā ʻōlelo kanaka (kūlohelohe). Hoʻokumu ʻo NLP i nā hiʻohiʻona hiki ke hoʻomaopopo a wehewehe i ka ʻōlelo kanaka. No laila, hiki i nā algorithms ke hana i nā hana e like me ka unuhi ʻōlelo, kālailai manaʻo, a me ka hana kikokikona.
Hoʻohana ʻia ʻo NLP i nā wahi like ʻole, me ka lawelawe o ka mea kūʻai aku, kūʻai aku, a me ka mālama olakino. ʻO ka pahuhopu o NLP ʻo ia ka ʻae ʻana i nā kamepiula e wehewehe a hoʻomaopopo i ka ʻōlelo kanaka e like me ka mea i kākau ʻia a ʻōlelo ʻia paha ma ke ʻano kokoke i ko ke kanaka.
Ka nānā o ʻAlau ʻAlawa
ʻAlau ʻAlawa he hana ʻōlelo kūlohelohe (NLP) a me ka ʻoihana ʻenehana aʻo mīkini. Hāʻawi lākou i kahi ākea o nā kumuwaiwai e kōkua i nā mea hoʻomohala i ka hoʻonui ʻana i ka wahi o NLP. ʻO kā lākou huahana koʻikoʻi ka waihona Transformers.
Hoʻolālā ʻia ia no nā noi hana ʻōlelo kūlohelohe. Eia kekahi, hāʻawi ia i nā hiʻohiʻona i hoʻomaʻamaʻa mua ʻia no nā hana NLP like ʻole e like me ka unuhi ʻōlelo a me ka pane nīnau.
Hāʻawi ʻo Hugging Face, ma waho aʻe o ka waihona Transformers, kahi kahua no ka kaʻana like ʻana i nā ʻikepili aʻo mīkini. ʻO kēia ka mea hiki ke komo koke i ka maikaʻi kiʻekiʻe ʻikepili no ke aʻo ʻana kā lākou mau hiʻohiʻona.
ʻO ka mission a Hugging Face ka hana ʻana i ka ʻōlelo kūlohelohe (NLP) i maʻalahi i nā mea hoʻomohala.
ʻO nā pūʻulu ʻikepili maka hugging kaulana loa
ʻO Cornell Movie-Dialogs Corpus
He ʻikepili kaulana kēia mai Hugging Face. ʻO Cornell Movie-Dialogs Corpus nā kūkākūkā i lawe ʻia mai nā kiʻi ʻoniʻoni kiʻiʻoniʻoni. Hiki ke hoʻomaʻamaʻa ʻia nā kumu hoʻohālike ʻōlelo kūlohelohe (NLP) me ka hoʻohana ʻana i kēia nui o ka ʻikepili kikokikona.
ʻOi aku ma mua o 220,579 mau hālāwai kūkākūkā ma waena o 10,292 mau hui kiʻiʻoniʻoni i komo i loko o ka hōʻiliʻili.
Hiki iā ʻoe ke hoʻohana i kēia ʻikepili no nā ʻano hana NLP. No ka laʻana, hiki iā ʻoe ke hoʻomohala i ka hana ʻōlelo a me nā papahana pane nīnau. Eia kekahi, hiki iā ʻoe ke hana i nā ʻōnaehana kūkākūkā. no ka mea, ua uhi nā kamaʻilio i nā kumuhana ākea. Ua hoʻohana nui ʻia ka ʻikepili i nā papahana noiʻi.
No laila, he mea pono loa kēia no nā mea noiʻi NLP a me nā mea hoʻomohala.
OpenWebText Corpus
ʻO ka OpenWebText Corpus kahi hōʻiliʻili o nā ʻaoʻao pūnaewele āu e ʻike ai ma ka platform Hugging Face. Loaʻa kēia ʻikepili i kahi ākea o nā ʻaoʻao pūnaewele, e like me nā ʻatikala, blogs, a me nā ʻaha kūkā. Ma waho aʻe, ua koho ʻia kēia mau mea a pau no ko lākou ʻano kiʻekiʻe.
He waiwai nui ka waihona no ka hoʻomaʻamaʻa ʻana a me ka loiloi ʻana i nā hiʻohiʻona NLP. No laila, hiki iā ʻoe ke hoʻohana i kēia waihona no nā hana e like me ka unuhi ʻana, a me ka hōʻuluʻulu ʻana. Eia kekahi, hiki iā ʻoe ke hana i ka nānā ʻana i ka manaʻo me ka hoʻohana ʻana i kēia dataset kahi waiwai nui no nā noi he nui.
Ua hoʻonohonoho ka hui Hugging Face i ka OpenWebText Corpus e hāʻawi i kahi laʻana kiʻekiʻe no ke aʻo ʻana. He waihona nui ia me ka 570GB o ka ʻikepili kikokikona.
BERT
ʻO BERT (Bidirectional Encoder Representations from Transformers) he kumu hoʻohālike NLP. Ua hoʻomaʻamaʻa mua ʻia a hiki ke loaʻa ma ka platform Hugging Face. Ua hana ʻia ʻo BERT e ka hui Google AI Language. Eia kekahi, ua hoʻomaʻamaʻa ʻia ma kahi ʻikepili kikokikona nui e ʻike i ka pōʻaiapili o nā huaʻōlelo i loko o kahi ʻōlelo.
No ka mea he kumu hoʻohālike e pili ana i ka transformer, hiki iā ia ke hana i ke kaʻina hoʻokomo piha i ka manawa hoʻokahi ma kahi o hoʻokahi huaʻōlelo i ka manawa. Hoʻohana ʻia kahi kumu hoʻohālike transformer nā mīkini hoʻolohe e wehewehe i ke komo pū ʻana.
ʻO kēia hiʻohiʻona hiki iā BERT ke hoʻomaopopo i ka pōʻaiapili o nā huaʻōlelo i loko o kahi māmalaʻōlelo.
Hiki iā ʻoe ke hoʻohana i ka BERT no ka hoʻokaʻawale ʻana i nā kikokikona, ka ʻike ʻōlelo, inoa inoa ʻike, a me ka hoʻoholo ʻana i ka coreference, ma waena o nā noi NLP ʻē aʻe. Eia kekahi, pono ia i ka hana ʻana i ka kikokikona a me ka hoʻomaopopo ʻana i ka heluhelu mīkini.
SQUAD
ʻO SQuAD (Stanford Question Answering Dataset) kahi waihona o nā nīnau a me nā pane. Hiki iā ʻoe ke hoʻohana iā ia e hoʻomaʻamaʻa i nā ʻano hoʻohālike heluhelu heluhelu mīkini. Aia ma luna o ka 100,000 mau nīnau a me nā pane ma nā kumuhana like ʻole. He ʻokoʻa ka SQuAD mai nā ʻikepili mua.
Hoʻopili ia i nā nīnau e koi ana i ka ʻike o ka pōʻaiapili kikokikona ma mua o ka hoʻohālikelike ʻana i nā huaʻōlelo.
ʻO ka hopena, he kumu waiwai maikaʻi loa ia no ka hana ʻana a me ka hoʻāʻo ʻana i nā hiʻohiʻona no ka pane nīnau a me nā hana hoʻomaopopo mīkini ʻē aʻe. Kākau nā kānaka i nā nīnau ma SQuAD pū kekahi. Hāʻawi kēia i kahi kūlana kiʻekiʻe o ka maikaʻi a me ke kūlike.
Ma keʻano holoʻokoʻa, he waiwai waiwai ʻo SQuAD no nā mea noiʻi NLP a me nā mea hoʻomohala.
MNLI
ʻO MNLI, a i ʻole Multi-Genre Natural Language Inference, kahi ʻikepili i hoʻohana ʻia e aʻo a hoʻāʻo nā mīkini aʻo mīkini no ka hoomaopopo ana i ka olelo maoli. ʻO ke kumu o ka MNLI ʻo ia ka ʻike inā he ʻoiaʻiʻo, hoʻopunipuni, a i ʻole kūʻokoʻa paha ka ʻōlelo i hāʻawi ʻia.
He ʻokoʻa ʻo MNLI mai nā ʻikepili mua i uhi ʻia i kahi ākea o nā kikokikona mai nā ʻano he nui. ʻOkoʻa kēia mau ʻano mai ka moʻolelo a i nā ʻāpana nūhou, a me nā pepa aupuni. Ma muli o kēia ʻano like ʻole, ʻo MNLI kahi laʻana hōʻikeʻike o ka kikokikona honua maoli. ʻOi aku ka maikaʻi ma mua o ka nui o nā ʻikepili ʻōlelo kūlohelohe.
Me ka nui o 400,000 mau hihia i ka dataset, hāʻawi ʻo MNLI i kahi helu nui o nā hiʻohiʻona no nā kumu hoʻomaʻamaʻa. Aia pū kekahi manaʻo no kēlā me kēia laʻana e kōkua i nā kumu hoʻohālike i kā lākou aʻo ʻana.
hope aipoalani
ʻO ka mea hope loa, ʻo Hugging Face datasets kahi waiwai waiwai no nā mea noiʻi a me nā mea hoʻomohala NLP. Hāʻawi ʻo Hugging Face i kahi hoʻolālā no ka hoʻomohala ʻana o NLP ma o ka hoʻohana ʻana i kahi pūʻulu o nā waihona.
Manaʻo mākou ʻo ka ʻikepili nui loa o Hugging Face ʻo ka OpenWebText Corpus.
Aia ma luna o 570GB o ka ʻikepili kikokikona i kēia waihona kiʻekiʻe. He kumu waiwai nui ia no ka hoʻomaʻamaʻa ʻana a me ka loiloi ʻana i nā hiʻohiʻona NLP. Hiki iā ʻoe ke hoʻohana i ka OpenWebText a me nā mea ʻē aʻe i kāu mau papahana e hiki mai ana.
Waiho i ka Reply