Okuqukethwe[Fihla][Bonisa]
Ngiyaqiniseka ukuthi uzwile ngobuhlakani bokwenziwa, kanye namagama afana nokufundwa komshini nokucutshungulwa kolimi lwemvelo (NLP).
Ikakhulukazi uma usebenzela ifemu ephethe amakhulu, uma kungezona izinkulungwane, zabathengi nsuku zonke.
Ukuhlaziywa kwedatha yokuthunyelwe kwenkundla yezokuxhumana, ama-imeyili, izingxoxo, izimpendulo zohlolo oluvulekile, neminye imithombo akuyona inqubo elula, futhi kuba nzima nakakhulu uma iphathiswa abantu kuphela.
Yingakho abantu abaningi benomdlandla ngamakhono we ukuhlakanipha okungekhona okwangempela ngomsebenzi wabo wansuku zonke kanye nezamabhizinisi .
Ukuhlaziywa kombhalo okunamandla e-AI kusebenzisa izinhlobonhlobo zezindlela noma ama-algorithms ukuze kuhunyushwe ulimi ngokwezinto eziphilayo, enye yazo okuwukuhlaziywa kwesihloko, okusetshenziselwa ukuthola ngokuzenzakalelayo izifundo emibhalweni.
Amabhizinisi angasebenzisa amamodeli okuhlaziya isihloko ukuze adlulisele imisebenzi elula emishinini esikhundleni sokuthwalisa kanzima izisebenzi ngedatha eningi.
Cabangela ukuthi singakanani isikhathi iqembu lakho elingase lilonge futhi linikele emsebenzini obaluleke kakhulu uma ikhompuyutha ingase ihlunge uhlu olungapheli lwenhlolovo yamakhasimende noma izinkinga zosekelo njalo ekuseni.
Kulo mhlahlandlela, sizobheka ukumodeliswa kwesihloko, izindlela ezahlukene zokumodela isihloko, futhi sithole ulwazi oluthile ngakho.
Iyini iTopic Modelling?
Ukumodela isihloko kuwuhlobo lwezimayini yombhalo lapho izibalo ezingagadiwe nezingagadiwe ukufunda imishini amasu asetshenziselwa ukuthola amathrendi kukhorasi noma umthamo obalulekile wombhalo ongahlelekile.
Kungathatha iqoqo lakho elikhulu lamadokhumenti futhi usebenzise indlela efanayo ukuhlela amagama abe amaqoqo amatemu futhi uthole izihloko.
Lokho kubonakala kuyinkimbinkimbi futhi kunzima, ngakho-ke masenze lula inqubo yokufanisa isifundo!
Cabanga ukuthi ufunda iphephandaba elinesethi yezigqamisi ezinemibala esandleni sakho.
Akukhona yini isidala lokho?
Ngiyaqaphela ukuthi kulezi zinsuku, bambalwa abantu abafunda amaphephandaba abhaliwe; yonke into idijithali, futhi ama-highlighters ayinto yesikhathi esidlule! Zenze ubaba noma umama wakho!
Ngakho-ke, lapho ufunda iphephandaba, ugqamisa amagama abalulekile.
Omunye umcabango owengeziwe!
Usebenzisa umbala ohlukile ukuze ugcizelele amagama angukhiye wezingqikithi ezihlukahlukene. Uhlukanisa amagama angukhiye ngokwezigaba ngokuya ngombala onikeziwe nezihloko.
Iqoqo ngalinye lamagama amakwe ngombala othile wuhlu lwamagama angukhiye esihlokweni esithile. Inani lemibala ehlukahlukene oyikhethile libonisa inani lamatimu.
Lokhu ukumodeliswa kwesihloko okubaluleke kakhulu. Isiza ekuqondeni, ekuhleleni, nasekufinyezweni kwamaqoqo amakhulu ombhalo.
Nokho, khumbula ukuthi ukuze uphumelele, amamodeli esihloko okuzenzakalelayo adinga okuqukethwe okuningi. Uma unephepha elifushane, ungase ufune ukuya esikoleni esidala futhi usebenzise ama-highlighters!
Kuyasiza futhi ukuchitha isikhathi wazi idatha. Lokhu kuzokunikeza umqondo oyisisekelo walokho imodeli yesihloko okufanele ikuthole.
Isibonelo, leyo dayari ingase imayelana nobudlelwano bakho bamanje nobedlule. Ngakho-ke, ngingalindela umngane wami wezimayini we-robot ukuthi aqhamuke nemibono efanayo.
Lokhu kungakusiza ukuthi uhlaziye kangcono ikhwalithi yezihloko ozikhombile futhi, uma kunesidingo, ulungise amagama angukhiye.
Izingxenye Zokulingisa Isihloko
Imodeli ye-Probabilistic
Okuguquguqukayo okungahleliwe kanye nokusabalalisa kwamathuba kuhlanganiswa ekumelelweni komcimbi noma into ethile kumamodeli angenzeka.
Imodeli yokunquma inikeza isiphetho esisodwa esingaba khona somcimbi, kuyilapho imodeli engenzeka ihlinzeka ngokusatshalaliswa kwamathuba njengesixazululo.
Lawa mamodeli acabangela iqiniso lokuthi asivamile ukuba nolwazi oluphelele lwesimo. Cishe kuhlale kunento yokungahleliwe okufanele icatshangelwe.
Isibonelo, umshwalense wempilo usekelwe eqinisweni lokuthi siyazi ukuthi sizokufa, kodwa asazi ukuthi nini. Lawa mamodeli angase abe nokunqunywa ngokwengxenye, okungahleliwe noma okungahleliwe ngokuphelele.
Ukuthola Ulwazi
Ukubuyiswa kolwazi (IR) kuwuhlelo lwesofthiwe oluhlela, lugcine, lubuyise, futhi luhlole ulwazi oluvela kumakhosombe amadokhumenti, ikakhulukazi ulwazi lombhalo.
Ubuchwepheshe busiza abasebenzisi ukuthola ulwazi abaludingayo, kodwa abulethi ngokucacile izimpendulo zemibuzo yabo. Yazisa ngobukhona kanye nendawo yamaphepha angase anikeze ulwazi oludingekayo.
Amadokhumenti afanelekile yilawo ahlangabezana nezidingo zomsebenzisi. Uhlelo lwe-IR olungenaphutha luzobuyisela amadokhumenti akhethiwe kuphela.
Ukuhambisana Kwesihloko
Ukubumbana Kwesihloko kuthola isihloko esisodwa ngokubala izinga lokufana kwe-semantic phakathi kwamatemu esihloko anamaphuzu aphezulu. Lawa mamethrikhi asiza ekuhlukaniseni phakathi kwezifundo ezitolika ngokwezibalo kanye nezihloko ezingama-artifact okucatshangwayo kwezibalo.
Uma iqembu lezimangalo noma amaqiniso lisekelana, kuthiwa ayahambisana.
Ngenxa yalokho, isethi yamaqiniso ahlangene ingaqondwa kumongo ohlanganisa wonke noma iningi lamaqiniso. “Umdlalo uwumdlalo weqembu,” “umdlalo udlalwa ngebhola,” futhi “umdlalo udinga umzamo omkhulu ongokomzimba” zonke lezi ziyizibonelo zamasethi amaqiniso ahlangene.
Izindlela Ezihlukene Zokumodela Isihloko
Le nqubo ebalulekile ingenziwa ngama-algorithms ahlukahlukene noma izindlela. Phakathi kwazo kukhona:
- I-Latent Dirichlet Allocation (LDA)
- I-Non Negative Matrix Factorization (NMF)
- I-Latent Semantic Analysis (LSA)
- I-Probabilistic Latent Semantic Analysis(pLSA)
I-Latent Dirichlet Allocation(LDA)
Ukuze kutholwe ubudlelwano phakathi kwemibhalo eminingi kukhorasi, kusetshenziswe umqondo wezibalo nesithombe we-Latent Dirichlet Allocation.
Ngokusebenzisa indlela ye-Variational Exception Maximization (VEM), isilinganiso esikhulu kakhulu sokungenzeka esivela kuqoqo eliphelele lombhalo siyafinyelelwa.
Ngokwesiko, amagama ambalwa aphezulu aphuma esikhwameni samagama akhethiwe.
Nokho, umusho awusho lutho nhlobo.
Ngokwale nqubo, itheksthi ngayinye izomelwa ngokusatshalaliswa okungenzeka kwezifundo, futhi isihloko ngasinye simelwe ukusabalaliswa okungenzeka kwamagama.
I-Non Negative Matrix Factorization(NMF)
I-Matrix ene-Non-Negative Values Factorization iyindlela esezingeni eliphezulu yokukhipha isici.
Uma kunezimfanelo eziningi futhi izici zingacacile noma zinokubikezelwa okubi, i-NMF inenzuzo. I-NMF ingakha amaphethini abalulekile, izihloko, noma amatimu ngokuhlanganisa izici.
I-NMF ikhiqiza isici ngasinye njengenhlanganisela yomugqa yesethi yesibaluli sokuqala.
Isici ngasinye siqukethe isethi yama-coefficient amelela ukubaluleka kwesibaluli ngasinye esicini. Isibaluli ngasinye senani kanye nevelu ngayinye yesibaluli sesigaba ngasinye sinecoefficient yaso.
Wonke ama-coefficients alungile.
I-Latent Semantic Analysis
Enye indlela yokufunda engagadiwe esetshenziselwa ukukhipha izixhumanisi phakathi kwamagama esethi yamadokhumenti ukuhlaziya okucashile kwe-semantic.
Lokhu kusisiza ukuba sikhethe amadokhumenti afanele. Umsebenzi wayo oyinhloko uwukunciphisa ubukhulu bekhorasi enkulu yedatha yombhalo.
Le datha engadingekile isebenza njengomsindo wangemuva ekutholeni imininingwane edingekayo kudatha.
I-Probabilistic Latent Semantic Analysis(pLSA)
Ukuhlaziywa kwe-Probabilistic latent semantic analysis (PLSA), ngezinye izikhathi okwaziwa ngokuthi i-probabilistic latent semantic indexing (PLSI, ikakhulukazi emibuthanweni yokubuyisa ulwazi), kuyindlela yezibalo yokuhlaziya idatha yemodi ezimbili kanye neyokwenzeka ngokubambisana.
Eqinisweni, okufana nokuhlaziywa kwe-semantic ecashile, okwavela kuyo i-PLSA, ukumelwa okunohlangothi oluphansi lokuguquguquka okubhekiwe kungatholwa ngokuya ngokuhambisana kwazo nokuguquguquka okuthile okufihliwe.
Hambisana ngesihloko Ukumodela ngePython
Manje, ngizokuhambisa ngesabelo sokumodela isihloko ngePython ulimi lohlelo usebenzisa isibonelo somhlaba wangempela.
Ngizobe ngimodela izihloko zocwaningo. Idathasethi engizoyisebenzisa lapha ivela ku-kaggle.com. Ungathola kalula wonke amafayela engiwasebenzisayo kulo msebenzi kulokhu Page.
Ake siqale nge-Topic Modelling sisebenzisa iPython ngokungenisa yonke imitapo yolwazi ebalulekile:
Isinyathelo esilandelayo ukufunda wonke amasethi edatha engizowasebenzisa kulo msebenzi:
Ukuhlaziywa Kwedatha Yokuhlola
I-EDA (I-Exploratory Data Analysis) iyindlela yezibalo esebenzisa izici ezibonakalayo. Isebenzisa izifinyezo zezibalo nokuvezwa kwezithombe ukuze kutholwe izitayela, amaphethini, nokuqagela kokuhlola.
Ngizokwenza ukuhlaziya idatha yokuhlola ngaphambi kokuthi ngiqale ukumodela isihloko ukuze ngibone ukuthi akhona yini amaphethini noma ubudlelwano kudatha:
Manje sizothola amanani angenalutho edathasethi yokuhlola:
Manje ngizobe ngihlela i-histogram ne-boxplot ukuze ngihlole ukuhlobana phakathi kokuguquguqukayo.
Inani lezinhlamvu kusethi Yezifinyezo Zesitimela liyahlukahluka kakhulu.
Esitimeleni, sinobuncane bezinhlamvu ezingama-54 kanye nobuningi bezinhlamvu ezingama-4551. 1065 inani elimaphakathi lezinhlamvu.
Isethi yokuhlola ibonakala ithakazelisa kakhulu kunesethi yokuqeqeshwa njengoba isethi yokuhlola inezinhlamvu ezingu-46 kuyilapho isethi yokuqeqesha inezi-2841.
Ngenxa yalokho, isethi yokuhlola yayinemidiyan yezinhlamvu ezingu-1058, efana nesethi yokuqeqesha.
Inombolo yamagama kusethi yokufunda ilandela iphethini efanayo nenani lezinhlamvu.
Ubuncane bamagama ayi-8 namagama aphezulu angama-665 avunyelwe. Ngenxa yalokho, inani lamagama eliphakathi liyi-153.
Kudingeka ubuncane bamagama ayisikhombisa ku-abstract kanye namagama aphezulu angama-452 kusethi yokuhlola.
I-median, kulokhu, i-153, efana ne-median kusethi yokuqeqesha.
Ukusebenzisa Omaka Ukwenza Imodeli Yesihloko
Kunamasu amaningana okumodela isihloko. Ngizosebenzisa amathegi kulo msebenzi; ake sibheke ukuthi singakwenza kanjani lokho ngokuhlola amathegi:
Izicelo Zokumodeliswa Kwesihloko
- Isifinyezo sombhalo singasetshenziswa ukubona isihloko sedokhumenti noma incwadi.
- Ingasetshenziselwa ukususa ukuchema kwamakhandidethi ekutholeni amaphuzu ezivivinyo.
- Ukumodela kwesihloko kungase kusetshenziselwe ukwakha ubudlelwano be-semantic phakathi kwamagama kumamodeli asekelwe kumagrafu.
- Ingathuthukisa isevisi yamakhasimende ngokuthola nokuphendula amagama angukhiye kumbuzo weklayenti. Amakhasimende azoba nokholo olwengeziwe kuwe njengoba uwanikeze usizo aludingayo ngesikhathi esifanele futhi ngaphandle kokubabangela ubunzima. Ngenxa yalokho, ukwethembeka kwamakhasimende kukhuphuka kakhulu, futhi ukubaluleka kwenkampani kuyanda.
Isiphetho
Ukumodeliswa kwesihloko kuwuhlobo lwezibalo zezibalo ezisetshenziselwa ukwembula “izihloko” ezingacacile ezikhona eqoqweni lemibhalo.
Kuwuhlobo lwemodeli yezibalo esetshenziswa ku ukufunda imishini kanye nokucutshungulwa kolimi lwemvelo ukuze kwembule imiqondo engabonakali ekhona kuqoqo lamathekisthi.
Kuyindlela yokumba umbhalo esetshenziswa kakhulu ukuthola amaphethini e-semantic acashile kumbhalo womzimba.
shiya impendulo