Hacker News

Rarraba rubutu tare da tsarin ZSTD na Python 3.14

Rarraba rubutu tare da tsarin ZSTD na Python 3.14 Wannan cikakken bincike na rubutu yana ba da cikakken bincike na ainihin abubuwan da ke tattare da shi da fa'ida mai fa'ida. Mahimman wuraren Mayar da hankali Tattaunawar ta ta'allaka ne akan: Manufofin mahimmanci da pro ...

10 min read Via maxhalford.github.io

Mewayz Team

Editorial Team

Hacker News
Yanzu ina da duk mahallin da nake buƙata. Bari in rubuta rubutun blog.

Rarraba Rubutu tare da Python 3.14's ZSTD Module

Python 3.14 yana gabatar da tsarin compression.zstd zuwa daidaitaccen ɗakin karatu, kuma yana buɗe hanya mai ban mamaki mai ƙarfi ga rarrabuwar rubutu ba tare da ƙirar na'ura ba. Ta hanyar auna yadda compressor zai iya matse rubutu guda biyu tare, zaku iya tantance kamanceninta - dabarar da ake kira Normalized Compression Distance (NCD) - kuma yanzu Zstandard yana sa ya zama mai saurin isa ga ayyukan samarwa.

Ta Yaya A Haƙiƙanin Rarraba Rubuce-rubucen Matsi yake Aiki?

Babban ra'ayin da ke bayan rarrabuwar tushen matsawa ya samo asali ne daga ka'idar bayanai. Lokacin da algorithm matsawa kamar Zstandard ya ci karo da toshe rubutu, yana gina ƙamus na ciki na alamu. Idan rubutu guda biyu suna raba kalmomi iri ɗaya, daidaitawa, da tsari, haɗa su tare yana haifar da sakamako kaɗan kaɗan fiye da matsawa babban rubutu kaɗai. Idan ba su da alaƙa, girman matsewar da aka matse ya kusanci jimillar duka girman ɗaiɗaikun ɗaya.

An kama wannan alaƙa ta hanyar Matsakaicin Matsakaicin Nisa: NCD (x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C (y)), inda C (x) shine matsewar girman rubutu x, kuma C (xy) shine matsewar girman rubutun da aka haɗe. Ƙimar NCD kusa da 0 yana nufin rubutun sun yi kama da juna sosai, yayin da ƙima kusa da 1 tana nufin ba su raba kusan babu wani abun ciki na bayanai.

Abin da ya sa wannan fasaha ta zama abin ban mamaki shi ne cewa ba ta buƙatar bayanan horo, ba alamar alama, babu kayan aiki, kuma babu GPU. Compressor da kansa yana aiki azaman abin koyi na tsarin rubutun. Bincike da aka buga a cikin takardu kamar "Ƙaramar Rubutun Rubutu: Hanyar Rarraba-Free tare da Compressors" (2023) ya nuna cewa NCD na tushen gzip ya goyi bayan BERT akan wasu ma'auni, yana haifar da sabunta sha'awar tsarin.

Me yasa Python 3.14's Zstandard Module ya zama Mai Canjin Wasa don NCD?

Kafin Python 3.14, ta amfani da Zstandard ana buƙatar shigar da fakiti na ɓangare na uku python-zstandard. Sabon tsarin compression.zstd, wanda aka gabatar ta hanyar PEP 784, yana jigilar kai tsaye tare da CPython. Wannan yana nufin dogaro da sifili sama da garanti, ingantaccen API mai goyan bayan Meta's libzstd-gwajin yaƙi. Don ayyukan rarrabawa musamman, Zstandard yana ba da fa'idodi da yawa akan gzip ko bzip2:

  • Guri: Zstandard yana matsawa 3-5x da sauri fiye da gzip a daidai gwargwado, yana yin rarrabuwar batch sama da dubunnan takardu masu yiwuwa a cikin daƙiƙa fiye da mintuna
  • Sana matakan matsawa: Matakan 1 zuwa na 22 suna ba ku damar yin ciniki da sauri don rabo, yana ba ku damar daidaita daidaitattun NCD tare da buƙatun kayan aiki
  • Tallafin ƙamus: Kamus na Zstandard waɗanda aka riga aka horar suna iya haɓaka matsawar ƙananan rubutu (a ƙarƙashin 4KB), wanda shine daidai girman girman daftarin aiki inda daidaiton NCD ya fi muhimmanci
  • API ɗin mai gudana: Samfurin yana goyan bayan ƙara matsawa, yana ba da damar rarrabuwa bututun da ke sarrafa rubutu ba tare da loda dukkan haɗin gwiwa cikin ƙwaƙwalwar ajiya ba
  • Sandar da kwanciyar hankali na ɗakin karatu: Babu rikice-rikice na sigar, babu haɗarin sarkar wadata - daga matsawa zstd yana aiki akan kowane shigarwa na Python 3.14+

Hanyoyin maɓalli: Rarraba tushen matsawa yana aiki mafi kyau lokacin da kuke buƙatar tushe mai sauri, marar dogaro wanda ke sarrafa rubutun yaruka da yawa a asali. Saboda damfara suna aiki da ɗanyen bytes maimakon ƙayyadaddun alamomin harshe, suna rarraba takardun Sinanci, Larabci, ko gauraye-haɗe kamar yadda Ingilishi yake—babu samfurin harshe da ake buƙata.

Menene Yayi kama da Aiki Aiki?

Ƙaramar mai rarraba NCD a cikin Python 3.14 ya dace a ƙarƙashin layi 30. Kuna ɓoye kowane rubutun magana (ɗaya a kowane nau'i), sannan ga kowane sabon takaddar, ƙididdige NCD akan kowane tunani kuma sanya nau'in tare da mafi ƙanƙanta nisa. Ga ainihin ma'anar:

Na farko, shigo da tsarin tare da daga matsawa zstd. Ƙayyade aikin da ke karɓar igiyoyin byte guda biyu, matsawa kowane ɗayansu, danne haɗin haɗin su, kuma ya dawo da makin NCD. Sannan gina lakabin nau'in taswirar ƙamus don wakiltar rubutun samfurin. Ga kowane daftarin aiki mai shigowa, ƙididdige nau'ikan nau'ikan, ƙididdige NCD, kuma zaɓi mafi ƙarancin.

💡 DID YOU KNOW?

Mewayz replaces 8+ business tools in one platform

CRM · Invoicing · HR · Projects · Booking · eCommerce · POS · Analytics. Free forever plan available.

Start Free →
A cikin ma'auni a kan bayanan bayanan AG News (rarrabuwar labarai na aji huɗu), wannan tsarin ta amfani da Zstandard a matakin matsawa 3 yana cimma daidaito kusan 62-65% - babu matakin horo, babu zazzagewar samfuri, da saurin rarrabuwa na kusan takaddun 8,000 a sakan daya akan tushen CPU guda ɗaya. Haɓaka matakin matsawa zuwa 10 yana tura daidaito zuwa kusan 68% akan farashin rage kayan aiki zuwa kusan takaddun 2,500 a sakan daya. Waɗannan lambobin ba su yi daidai da na'urorin lantarki masu kyau ba, amma suna samar da tushe mai ƙarfi don yin samfuri, lakabin bayanai, ko muhallin da shigar da abubuwan dogaro da ML ba shi da amfani.

Yaya NCD take Kwatanta da Rarraba ML na Gargajiya?

Amsar gaskiya ita ce, NCD ba ta maye gurbin na'urori masu amfani da wutar lantarki a tsarin samar da manyan ayyuka ba. Samfura kamar BERT ko na tushen GPT suna samun daidaito 94%+ akan daidaitattun ma'auni. Koyaya, NCD tare da Zstandard sun mamaye wani yanki na musamman. Ya yi fice a cikin yanayin farkon sanyi inda kuke da misalan misalai sama da 50 a kowane aji - yanayin da ko da ingantattun samfura ke gwagwarmaya. Yana buƙatar lokacin horo na sifili, yana sarrafa kowane harshe ko ɓoyewa ba tare da gyarawa ba, kuma yana aiki gaba ɗaya akan CPU tare da ƙwaƙwalwar ajiya akai-akai.

Don kasuwancin da ke sarrafa ɗimbin abun ciki mai shigowa - tikitin tallafi, ambaton kafofin watsa labarun, sake dubawa na samfur - mai rarraba Zstandard NCD zai iya aiki azaman na'ura mai ba da hanya tsakanin hanyoyin sadarwa na farko wanda ke rarraba takardu a ainihin lokacin kafin samfuran masu tsada su daidaita sakamakon. Wannan bututun mai matakai biyu yana rage ƙimar ƙima sosai yayin da yake kiyaye daidaito gabaɗaya. Dandali yana sarrafa abun cikin da aka samar da mai amfani a sikelin, kamar Mewayz's 207-module business OS wanda fiye da 'yan kasuwa 138,000 ke amfani da shi, suna amfana daga rabe-raben nauyi zuwa saƙonni, alamar abun ciki, da keɓance abubuwan mai amfani ba tare da manyan abubuwan more rayuwa ba.

Mene ne Iyaka da Mafi kyawun Ayyuka?

Rabi-tushen matsawa yana da sanantattun iyakoki da yakamata ku lissafta su. Gajerun rubutu (a ƙarƙashin 100 bytes) suna samar da makin NCD maras dogaro saboda kwampreso ba shi da isassun bayanai don gina alamu masu ma'ana. Dabarar kuma tana kula da zaɓin rubutun magana - wakilai marasa kyau suna ƙasƙantar da daidaito sosai. Kuma saboda NCD ma'aunin nisa ne maimakon ƙirar yuwuwar, ta halitta ba ta haifar da ƙima mai ƙarfi.

Don samun mafi yawan wannan tsarin: yi amfani da rubutun tunani na aƙalla bytes 500 a kowane rukuni, gwaji tare da haɗa misalai da yawa a kowane aji (takardun wakilai 2-3 waɗanda aka haɗa tare suna ba da mafi kyawun ƙamus), daidaita casing ɗin rubutu da farar sarari kafin matsawa, da ma'auni a cikin matakan matsawa na Zstandard 3, 6, da 10 don nemo daidaitaccen saurin ku. Don ƙarami-rubutu rarrabuwa, pre-horar da ƙamus na Zstandard a kan yankin yankinku - wannan mataki ɗaya zai iya inganta daidaito da maki 8-12 bisa ga gajerun takardu.

Tambayoyin da ake yawan yi

Shin rarrabuwar tushen matsawa yana aiki don nazarin jin daɗi?

Yana iya, amma tare da caveats. Binciken ra'ayi yana buƙatar gano bambance-bambancen sauti na dabara a cikin rubutu iri ɗaya. NCD tana aiki mafi kyau don rarrabuwar batutuwa inda takardu a cikin nau'ikan nau'ikan daban-daban ke amfani da takamaiman ƙamus. Don jin daɗi, daidaito yawanci ƙasa a kusa da 55-60% - mafi kyau fiye da bazuwar, amma ba a shirye-shiryen samarwa da kansa ba. Haɗa fasalulluka na NCD tare da ƙirar jujjuyawar dabaru mara nauyi yana inganta sakamako sosai.

Zan iya amfani da compression.zstd module a cikin nau'ikan Python kafin 3.14?

A'a. Tsarin compression.zstd sabo ne a cikin Python 3.14. Don nau'ikan da suka gabata, shigar da kunshin python-zstandard daga PyPI, wanda ke ba da daidaitattun ayyuka damfara() da decompress() ayyuka. Hankalin NCD ya kasance iri ɗaya - bayanin shigo da kaya kawai ya canza. Da zarar ka haɓaka zuwa 3.14, za ka iya sauke dogara ga ɓangare na uku gaba ɗaya.

Ta yaya Zstandard NCD yake aiki idan aka kwatanta da TF-IDF tare da kamannin cosine?

Akan rarrabuwar batutuwa masu yawa tare da madaidaitan bayanan bayanai, TF-IDF tare da kamannin cosine yawanci suna samun daidaiton 75-82% idan aka kwatanta da 62-68% na Zstandard NCD. Duk da haka, TF-IDF na buƙatar fitattun vectoriser, ƙayyadaddun ƙamus, da lissafin takamaiman kalmomin dakatar da harshe. Zstandard NCD ba yana buƙatar ɗayan waɗannan abubuwan da aka riga aka tsara ba, yana aiki a cikin harsuna daban-daban daga cikin akwatin, kuma yana rarraba sabbin takaddun cikin lokaci akai-akai ba tare da la'akari da girman ƙamus ba. Don saurin samfuri ko muhallin harsuna da yawa, NCD galibi ita ce hanya mafi sauri zuwa tsarin aiki.

Ko kuna gina bututun abun ciki mai sarrafa kansa, sarrafa saƙon abokin ciniki, ko ƙirar ƙira don kasuwancin ku na dijital, tallafin Zstandard na Python 3.14 yana sa NCD na tushen matsawa ya fi sauƙi fiye da kowane lokaci. Idan kuna neman dandamali na gama-gari don sarrafa abubuwan kasuwancin ku, samfuranku, kwasa-kwasan, da hulɗar abokan ciniki, fara ginawa tare da Mewayz a yau kuma sanya waɗannan fasahohin su yi aiki a duk ayyukanku.