Okuqukethwe[Fihla][Bonisa]
Yonke iphrojekthi yokufunda ngomshini incike kudathasethi enhle. Yile dathasethi enkulu ezokuvumela ukuthi uqeqeshe futhi uqinisekise imodeli yakho ye-ML. Ngakho-ke, ingxenye enkulu yomsebenzi kuphrojekthi ye-ML ukuthola idathasethi ephelele yezidingo zakho. Nokho, akwenzeki ngaso sonke isikhathi ukuthola inketho evumelana nesifiso sakho, njengoba amafayela amaningi abukeka ethakaseleka, ekugcineni, awanjalo.
Kungaba nzima ukuchitha isikhathi ukulanda inqwaba yamasethi edatha uze ufike kusethi ekahle. Ngalokho engqondweni, siqoqe izinketho ezithile ezibonakala zithakazelisa futhi ezingakusiza uthuthukise iphrojekthi yakho ye-ML. Qaphela ukuthi ezinye zenzelwe okomuntu siqu esikhundleni sokusetshenziselwa ukuthengisa, ngakho-ke bheka lezi zinketho njengendlela yokuthola ulwazi endaweni yonke ye-ML.
Izisekelo Zedathasethi
Ngaphambi kokuthi sikhulume ngamasethi edatha, kufanele sichaze amagama athile. Kumaphrojekthi we-Artificial Intelligence, ikakhulukazi Ukufunda komshini, inani elikhulu ledatha liyadingeka, elizosetshenziselwa ukuqeqesha i-algorithm. Leli nani ledatha liqoqwa kusizindalwazi, esiwusizo kakhulu ukufundisa i-algorithm.
Ngale datha, i-algorithm iyaqeqeshwa - ibuye ihlolwe - futhi ikwazi ukuthola amaphethini, isungule ubudlelwano futhi ngaleyo ndlela ithathe izinqumo ngokuzimele. Ngaphandle kokuqeqeshwa, Ukufunda komshini Ama-algorithms awakwazi ukwenza noma yisiphi isenzo. Ngakho-ke, okungcono idatha yokuqeqeshwa, imodeli izosebenza kangcono. Ukuze isizindalwazi sibe usizo kuphrojekthi, akukona mayelana nenani: futhi kumayelana nokuhlukaniswa.
Ngokufanelekile, idatha kufanele ibhalwe kahle. Cabanga ngodaba lwama-chatbots: ukufakwa kolimi kubalulekile, kodwa ukuhlaziya ngokucophelela kwe-syntactic kufanele kwenziwe ukuze i-algorithm edaliwe ikwazi ukuqonda lapho umxhumanisi esebenzisa isitsotsi. Kulapho kuphela lapho umsizi obonakalayo ezokwazi ukwethula impendulo ngokwalokho okucelwe umsebenzisi.
Amasethi edatha angenziwa kusuka kuzinhlolovo, idatha yokuthengwa komsebenzisi, ukuhlaziya okushiywe kumasevisi, nangezinye izindlela eziningi ezivumela ukuqoqa ulwazi oluwusizo oluhlelwe ngamakholomu nemigqa kufayela le-CSV.
Ngaphambi kokuthi uqale ukucinga isethi yedatha ephelele, kubalulekile ukuthi wazi injongo yephrojekthi yakho, ikakhulukazi uma isuka endaweni ethile, njengesimo sezulu, ezezimali, ezempilo, njll. Lokhu kuzobeka umthombo lapho uzothola khona Idathasethi.
Amasethi edatha we-ML
Ukuqeqeshwa kwe-Chatbot
I-chatbot esebenzayo idinga inani elikhulu ledatha yokuqeqeshwa ukuze ixazulule ngokushesha imibuzo yabasebenzisi ngaphandle kokungenelela komuntu. Kodwa-ke, ibhodlela eliyinhloko ekuthuthukisweni kwe-chatbot ukuthola idatha yengxoxo engokoqobo, egxile emsebenzini ukuze kuqeqeshwe lawa masistimu asekelwe kuMshini Wokufunda.
Idathasethi yezingxoxo iqoqa idatha ngefomethi yombuzo nezimpendulo. Ilungele ukuqeqesha ama-chatbots azonikeza izethameli izimpendulo ezizenzakalelayo. Ngaphandle kwale datha, i-chatbot izohluleka ukuxazulula ngokushesha imibuzo yabasebenzisi noma ukuphendula imibuzo yabasebenzisi ngaphandle kwesidingo sokungenelela komuntu.
Ngokusebenzisa lawa madathasethi, amabhizinisi angakha ithuluzi elinikeza izimpendulo ezisheshayo kumakhasimende 24/7 futhi ishibhe kakhulu kunokuba neqembu labantu abenza usekelo lwamakhasimende.
1. Isethi Yedatha Yezimpendulo
Le dathasethi inikeza isethi yezindatshana ze-Wikipedia, imibuzo kanye nezimpendulo zazo ezikhiqizwa ngesandla. Kuyidathasethi eqoqwe phakathi kuka-2008 no-2010 ukuze isetshenziswe kuyo ucwaningo lwezifundo.
2. Idatha Yolimi
I-Language Data iyisizindalwazi esiphethwe yi-Yahoo enolwazi olukhiqizwe kwezinye zezinsizakalo zenkampani, njenge-Yahoo! Impendulo, esebenza njengomphakathi ovulekile ukuze abasebenzisi bathumele imibuzo nezimpendulo.
3. WikiQA
Ikhophasi ye-WikiQA nayo iqukethe isethi yemibuzo nezimpendulo. Umthombo wemibuzo yi-Bing, kuyilapho izimpendulo zixhuma ekhasini le-Wikipedia elinamandla okuxazulula umbuzo wokuqala.
Sekukonke, kunemibuzo engaphezu kuka-3,000 kanye nesethi yemisho engu-29,258 kudathasethi, cishe engu-1,400 yayo ihlukaniswe njengezimpendulo zombuzo ohambisanayo.
Idatha kahulumeni
Amasethi edatha akhiqizwe ohulumeni aletha idatha yezibalo zabantu, okokufaka okuhle kwamaphrojekthi ahlobene nokuqonda izitayela zomphakathi, ukudala izinqubomgomo zomphakathi, nokuthuthukisa umphakathi. Lokhu kungaba usizo emikhankasweni yezepolitiki, ukukhangisa okuqondiwe, noma ukuhlaziya imakethe.
Lawa madathasethi ngokuvamile aqukethe idatha engaziwa, ngakho-ke nakuba amamodeli ekwazi ukufinyelela idatha eluhlaza, akukho ukwephulwa kobumfihlo bomuntu siqu.
4. Idatha.gov
Yasungulwa ngo-2009, i-Data.gov iwumthombo wedatha waseNyakatho Melika. Ikhathalogi yayo iyamangalisa: amasethi edatha angaphezu kuka-218,000 avumela ukuhlukaniswa ngefomethi, omaka, izinhlobo, nezihloko.
5. I-EU Open Data Portal
I-EU Open Data Portal inikeza ukufinyelela kudatha evuliwe eyabiwe izikhungo ze-European Union. Lena idatha engahloselwe ukusetshenziswa kwezohwebo nokungekona okokuthengisa. Kubasebenzisi kunamasethi edatha angaphezu kwezinkulungwane eziyi-15.5, ahlanganisa izihloko ezifana nempilo, amandla, imvelo, amasiko, kanye nemfundo.
Idatha yezempilo
Ngenxa yenkinga yezempilo eqhubekayo emhlabeni wonke, amasethi edatha akhiqizwa izinhlangano zezempilo abalulekile ekuthuthukiseni izixazululo ezisebenzayo zokusindisa izimpilo. Lawa madathasethi angasiza ekuhlonzeni izinto eziyingozi, enze amaphethini okudluliselwa kwezifo, futhi asheshise ukuxilongwa.
Lawa madathasethi ahlanganisa amarekhodi ezempilo, izibalo zabantu beziguli, ukusabalala kwezifo, ukusetshenziswa komuthi, amanani okudla okunempilo, nokunye okuningi.
6. IGlobal Health Observatory
Le sethi yedatha iwuhlelo lweNhlangano Yezempilo Yomhlaba (WHO). Ihlinzeka ngedatha yomphakathi ehlobene nezindawo ezihlukene zempilo, ehlelwa amatimu afana nezinhlelo zezempilo, ukulawula ukusetshenziswa kukagwayi, ukubeletha, i-HIV/AIDS, njll. Kukhona futhi inketho yokubheka idatha ku-COVID-19.
7. IKHODI-19
I-CORD-19 iyikholasi yokushicilelwe kwezemfundo ku-COVID-19 nezinye izindatshana ezimayelana ne-coronavirus entsha. Kuyidathasethi evulekile ehloselwe ukukhiqiza imininingwane emisha ku-COVID-19.
Idatha yezomnotho
Amasethi edatha ahlobene nesimo sezimali ngokuvamile aqoqa inani elikhulu lolwazi, njengoba kuvamile ukuthi aqoqwe isikhathi eside. Zilungele ukudala izibikezelo zezomnotho noma ukusungula izitayela zokutshala izimali.
Ngamadathasethi wezezimali alungile, a Imodeli yokufunda ngomshini ingase ikwazi ukubikezela ukuziphatha kwempahla ethile. Kungakho umkhakha wezezimali wenza konke okusemandleni awo ukudala imodeli ye-ML ephumelelayo, njengoba noma yini engabikezela kahle inamandla okukhiqiza izigidi zamadola. I- Machine Learning isivele ibikezela ukuziphatha kwezakhamuzi, okunomthelela endleleni abenzi bezinqubomgomo abenza ngayo umsebenzi wabo.
8. IsiKhwama Sezimali Zomhlaba Wonke
Isethi yedatha ye-IMF inohlu lwezinkomba zezomnotho nezezimali, izibalo zezwe elingamalungu, nenye idatha yemalimboleko nesilinganiso sokushintshisana.
9. Ibhange Lomhlaba
Inqolobane yeBhange Lomhlaba iqukethe amasethi edatha ahlukene anolwazi lwezomnotho oluvela emazweni ahlukene. Kunamadathasethi angaphezu kuka-17,000 ahlukaniswe amazwekazi.
Ukubuyekezwa komkhiqizo namasevisi
Ukuhlaziya imizwa kuthole ukuthi isebenza emikhakheni eyahlukene manje esiza amabhizinisi ukuthi alinganisele futhi afunde kumakhasimende awo noma amakhasimende ngendlela efanele. Ukuhlaziywa kwemizwa kuya ngokuya kusetshenziselwa ukuqapha inkundla yezokuxhumana, ukuqapha umkhiqizo, izwi lekhasimende (i-VoC), isevisi yamakhasimende, kanye nocwaningo lwemakethe.
Ukuhlaziya imizwa kusebenzisa i-NLP (i-neuro-linguistic programming) izindlela nama-algorithms asekelwe emthethweni, ayingxube, noma ancike kumasu okufunda ngomshini ukuze afunde idatha kumasethi wedatha.
Idatha edingekayo ekuhlaziyweni kwemizwelo kufanele ikhethekile futhi idingeka ngobuningi obukhulu. Ingxenye eyinselele kakhulu mayelana nenqubo yokuqeqeshwa kokuhlaziya imizwelo ukungatholi idatha ngamanani amakhulu; esikhundleni salokho, kuwukuthola amasethi edatha afanelekile. Lawa masethi wedatha kufanele ahlanganise indawo ebanzi yokuhlaziya imizwa kanye namacala okusetshenziswa.
10. Ukubuyekezwa kwe-Amazon
Le dathasethi iqukethe izibuyekezo ze-Amazon ezingaba yizigidi ezingu-35, ezithatha isikhathi seminyaka engu-18 solwazi oluqoqiwe. Kuyidathasethi yomkhiqizo, umsebenzisi, nokubuyekeza okuqukethwe.
11. Yelp Izibuyekezo
I-Yelp iphinde inikeze idathasethi esuselwe kulwazi oluqoqwe kusevisi yayo. Kunezibuyekezo ezingaphezu kwezigidi ezingu-8, amathiphu ayisigidi esingu-1, kanye nezibaluli ezicishe zibe yizigidi ezingu-1.5 ezihlobene namabhizinisi, njengamahora okuvula nokutholakala.
12. Ukubuyekezwa kwe-IMDB
Le database iqukethe isethi yezibuyekezo zama-movie ezingaphezu kwezinkulungwane ezingama-25 zokuqeqeshwa kanye nezinye izinkulungwane ezingama-25 zokuhlolwa okuthathwe ngokungakahleleki ekhasini le-IMDB, okukhethekile ezilinganisweni zamamuvi. Iphinde inikeze idatha engenamalebula njengento eyengeziwe.
Amasethi edatha wezinyathelo zokuqala ku-ML
13. Isethi Yedatha Yekhwalithi Yewayini
Le dathasethi inikeza ulwazi oluhlobene newayini, kokubili elibomvu neliluhlaza, elikhiqizwe enyakatho ye-Portugal. Inhloso ukuchaza ikhwalithi yewayini ngokusekelwe ekuhlolweni kwe-physicochemical. Kuyathakazelisa kulabo abafuna ukuzijwayeza ukudala uhlelo lokubikezela.
14. I-Titanic Dataset
Le dathasethi iletha idatha evela kubagibeli bangempela abangu-887 abavela ku-Titanic, ikholomu ngayinye ichaza ukuthi basindile yini, iminyaka yabo, isigaba sabagibeli, ubulili, kanye nemali yokugibela abayikhokhile. Le dathasethi ibiyingxenye yenselelo eyethulwe iplathifomu ye-Kaggle, inhloso yayo bekuwukwakha imodeli engabikezela ukuthi yibaphi abagibeli abasinde ekucwileni kwe-Titanic.
Izinkundla Zokuthola Amanye Amasethi wedatha
Uma ufuna ukuya phambili futhi uthole eyakho idathasethi, indlela engcono kakhulu ukuphequlula amaqoqo adume kakhulu Ukufunda komshini indawo yonke:
Igagasi
I-Kaggle, inkampani ephethwe yi-Google LLC, iwumphakathi waku-inthanethi wososayensi bedatha nezingcweti Zokufunda Ngomshini. I-Kaggle ivumela abasebenzisi ukuthi bathole futhi bashicilele amasethi edatha, bahlole futhi bakhe amamodeli endaweni yesayensi yedatha esekelwe kuwebhu; ukusebenza nabanye ososayensi idatha kanye Onjiniyela Bokufunda Ngomshini, futhi ubambe iqhaza emiqhudelwaneni yokuxazulula izinselele zesayensi yedatha.
I-Kaggle yaqala ngo-2010 ngokunikeza imiqhudelwano yokufunda ngomshini futhi manje isinikeza umphakathi inkundla yedatha, ibhentshi lokusebenza elisekelwe efini lesayensi yedatha kanye nemfundo ye-Artificial Intelligence.
Usesho lwesethi yedatha
I-Dataset Search iyinjini yokusesha evela ku-Google esiza abacwaningi ukuthi bathole idatha eku-inthanethi etholakala mahhala ukuthi isetshenziswe. Kuwebhu yonkana, kunezigidi zamasethi edatha cishe nganoma yisiphi isihloko osithakaselayo.
Uma ubheke ukuthenga umdlwane, ungathola amasethi edatha ahlanganisa izikhalazo zabathengi bomdlwane noma izifundo zokuqaphela umdlwane. Noma uma uthanda ukushushuluza, ungathola idatha ngenzuzo yezindawo zokungcebeleka eqhweni noma amanani okulimala nezinombolo zokubamba iqhaza. I-Dataset Search ikhombe cishe izigidi ezingu-25 zalawa madathasethi, okukunikeza indawo eyodwa ukucinga amasethi edatha futhi uthole izixhumanisi lapho idatha ikhona.
I-UCI Machine Learning Repository
I-UCI Machine Learning Repository iyiqoqo lesizindalwazi, ithiyori yesizinda, kanye namajeneretha wedatha asetshenziswa umphakathi Wokufunda Ngomshini ukuze kuhlaziywe amandla ama-algorithms Wokufunda Ngomshini. Ingobo yomlando yadalwa njengendawo yomlando ye-ftp ngo-1987 nguDavid Aha kanye nabanye abafundi abaneziqu e-UC Irvine.
Kusukela ngaleso sikhathi, ibisetshenziswa kabanzi ngabafundi, othisha, nabacwaningi emhlabeni wonke njengomthombo oyinhloko wamasethi edatha e-ML. Njengenkomba yomthelela wendawo egciniwe, ikhonjwe izikhathi ezingaphezu kuka-1000, okuyenza ibe ngelinye "lamaphepha" aphezulu ayi-100 acashunwe kakhulu kuyo yonke isayensi yekhompyutha.
I-Quandl
I-Quandl iyinkundla enikeza abasebenzisi bayo amasethi edatha ezomnotho, ezezimali, namanye. Abasebenzisi bangalanda idatha yamahhala, bathenge idatha ekhokhelwayo noma badayisele idatha ku-Quandl. Kungaba ithuluzi eliwusizo ekuthuthukisweni kwe ama-algorithms okuhweba, Ngokwesibonelo.
Isiphetho
Ngokuhlola lawa mathuluzi, uqinisekile ukuthi uzothola okokufaka okuhle kwamaphrojekthi akho. Qiniseka ukuthi ukhetha isethi yedatha efanele kakhulu izidingo zakho ezithile futhi uhlale ukhumbula: akukhona nje ubuningi, kodwa futhi nekhwalithi. Idathasethi iyisisekelo sanoma iyiphi Iphrojekthi yokufunda ngomshini futhi kubalulekile ukwakha phezu kwedatha yekhwalithi ukuze ugweme ingozi yokufinyelela iziphetho ezinephutha.
shiya impendulo