Hacker News

Okugabanya ebiwandiiko ne modulo ya ZSTD eya Python 3.14

Okugabanya ebiwandiiko ne modulo ya ZSTD eya Python 3.14 Okwekenenya kuno okujjuvu okw’ebiwandiiko kuwa okwekenneenya mu bujjuvu ebitundu byakyo ebikulu n’ebigendererwa ebigazi. Ebitundu Ebikulu Ebitunuuliddwa Okukubaganya ebirowoozo kuno kwesigamye ku: Enkola enkulu ne pro...

8 min read Via maxhalford.github.io

Mewayz Team

Editorial Team

Hacker News
Kati nnina context yonna gye nneetaaga. Ka mpandiike ekiwandiiko kya blog.

Okugabanya ebiwandiiko ne Python 3.14's ZSTD Module

Python 3.14 eyanjulira modulo ya compression.zstd mu tterekero ly'ebitabo erya bulijjo, era esumulula enkola ey'amaanyi eyeewuunyisa mu kugabanya ebiwandiiko awatali bikolwa bya kuyiga byuma. Nga opima engeri compressor gy’esobola okusika ebiwandiiko bibiri wamu, osobola okuzuula okufaanagana kwabyo — enkola eyitibwa Normalized Compression Distance (NCD) — era kati Zstandard egifuula ey’amangu ekimala ku mirimu gy’okufulumya.

Okugabanya Ebiwandiiko Okusinziira Ku Kunyigirizibwa Mu butuufu Kukola Kutya?

Endowooza enkulu emabega w’okugabanya okusinziira ku kunyigirizibwa esibuka mu ndowooza y’amawulire. Enkola y’okunyigiriza nga Zstandard bw’esisinkana bulooka y’ebiwandiiko, ezimba nkuluze ey’omunda ey’ebifaananyi. Singa ebiwandiiko bibiri bigabana ebigambo ebifaanagana, ensengeka y’ebigambo, n’ensengeka, okubinyiga wamu kivaamu ekivaamu ekinene katono okusinga okunyigiriza ekiwandiiko ekinene kyokka. Bwe ziba nga tezikwatagana, sayizi enyigirizibwa eyungiddwa esemberera omugatte gwa sayizi zombi ssekinnoomu.

Enkolagana eno ekwatibwa ensengekera ya Normalized Compression Distance: NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y)), nga C(x) ye sayizi enyigirizibwa ey’ekiwandiiko x, ate C(xy) ye sayizi enyigirizibwa ey’ebiwandiiko ebibiri ebiyungiddwa. Omuwendo gwa NCD okumpi ne 0 kitegeeza nti ebiwandiiko bifaanagana nnyo, ate omuwendo okumpi ne 1 kitegeeza nti kumpi tebigabana bikwata ku mawulire.

Ekifuula enkola eno ey’ekitalo kwe kuba nti tekyetaagisa data ya kutendekebwa, tekyetaagisa tokenization, tekyetaagisa embeddings, era tekyetaagisa GPU. Compressor yennyini ekola nga model eyigiddwa ey’ensengeka y’ekiwandiiko. Okunoonyereza okwafulumizibwa mu mpapula nga "Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors" (2023) kwalaga nti NCD eyesigamiziddwa ku gzip yavuganya ne BERT ku bipimo ebimu, ekyaleetawo okwagala okupya mu nkola eno.

Lwaki Zstandard Module ya Python 3.14 Ye Game-Changer ku NCD?

Nga Python 3.14 tennabaawo, okukozesa Zstandard kyali kyetaagisa okuteeka ekintu eky'okusatu python-zstandard package. Module empya compression.zstd, eyatongozebwa okuyita mu PEP 784, esindikibwa butereevu ne CPython. Kino kitegeeza zero dependency overhead ne API ekakasiddwa, ennywevu ewagirwa libzstd ya Meta eyagezesebwa mu lutalo. Ku mirimu gy'okugabanya mu ngeri ey'enjawulo, Zstandard etuwa ebirungi ebiwerako ku gzip oba bzip2:

  • Obwangu: Zstandard enyigiriza amangu emirundi 3-5 okusinga gzip ku migerageranyo egy’okugeraageranya, ekifuula okugabanya mu bibinja ku nkumi n’enkumi z’ebiwandiiko okusoboka mu sikonda okusinga eddakiika
  • Tunable compression levels: Levels 1 okutuuka ku 22 zikusobozesa okusuubula sipiidi ku ratio, okukusobozesa okupima NCD precision okusinziira ku byetaagisa throughput
  • Obuwagizi bw’enkuluze: Enkuluze za Zstandard ezitendekeddwa nga tezinnabaawo zisobola okulongoosa ennyo okunyigiriza ebiwandiiko ebitonotono (wansi wa 4KB), nga kino kyennyini kye kitundu ky’obunene bw’ebiwandiiko obutuufu bwa NCD we businga obukulu
  • Streaming API: Module ewagira okunyigiriza okw’okweyongera, okusobozesa payipu z’okugabanya ezikola ku biwandiiko awatali kutikka corpora yonna mu jjukira
  • Okutebenkera kw'etterekero ly'ebitabo ery'omutindo: Tewali nkyusa ekontana, tewali bulabe bwa nkola y'okugaba — okuva mu kunyigiriza okuyingiza zstd ekola ku buli kussaako Python 3.14+

Okutegeera okukulu: Okugabanya okwesigamiziddwa ku kunyigiriza kukola bulungi nga weetaaga omusingi ogw’amangu, ogutaliimu kwesigama ogukwata ebiwandiiko eby’ennimi nnyingi mu ngeri ey’obuzaale. Olw’okuba compressor zikola ku bytes embisi okusinga tokens ezikwata ku lulimi, zigabanya ebiwandiiko by’Oluchina, Oluwarabu, oba olulimi olutabuddwa mu ngeri ennungi nga Olungereza — tekyetaagisa nkola ya lulimi.

nga bwe kiri

Okussa mu nkola Enkola Kufaanana Ki?

Ekisengeka kya NCD ekitono ennyo mu Python 3.14 kiyingira wansi wa layini 30. Owandiika buli kiwandiiko ekijuliziddwa (ekimu ku buli mutendera), olwo ku buli kiwandiiko ekipya, obala NCD okusinziira ku buli kiwandiiko ekijuliziddwa era n’ogaba ekika ekisinga okuba n’ebanga erisinga wansi. Wano waliwo ensonga enkulu:

Okusooka, yingiza modulo ne okuva mu kunyigiriza import zstd. Lambulula omulimu ogukkiriza ennyiriri za byte bbiri, okunyigiriza buli emu kinnoomu, okunyigiriza okuyungibwa kwazo, n’okuzzaayo obubonero bwa NCD. Oluvannyuma zimba nkuluze ekola maapu y’ebika by’ebiwandiiko ku biwandiiko eby’ekyokulabirako ebikiikirira. Ku buli kiwandiiko ekiyingira, ddamu ku biti, bala NCD, era londa ekitono ennyo.

💡 DID YOU KNOW?

Mewayz replaces 8+ business tools in one platform

CRM · Invoicing · HR · Projects · Booking · eCommerce · POS · Analytics. Free forever plan available.

Start Free →

Mu bipimo okusinziira ku AG News dataset (four-class news classification), enkola eno nga ekozesa Zstandard ku compression level 3 etuuka ku roughly 62-65% accuracy — tewali mutendera gwa kutendekebwa, tewali model download, ne classification speed ya biwandiiko nga 8,000 buli second ku CPU core emu. Okulinnyisa omutindo gw’okunyigiriza okutuuka ku 10 kisika obutuufu okutuuka ku bitundu nga 68% ku muwendo gw’okukendeeza ku bungi bw’okuyita okutuuka ku biwandiiko nga 2,500 buli sikonda. Ennamba zino tezikwatagana na tulansifooma ezirongooseddwa obulungi, naye ziwa omusingi omunywevu ogw’okukola prototyping, data labeling triage, oba embeera nga okuteeka ML dependencies tekisoboka.

NCD Egeraageranyizibwa Etya ku Nsengeka ya ML ey’Ennono?

Eky’okuddamu mu bwesimbu kiri nti NCD si kifo kya bisengejja ebisinziira ku tulansifooma mu nkola z’okufulumya ezirina emigabo eminene. Models nga BERT oba GPT-based classifiers zituuka ku 94%+ obutuufu ku standard benchmarks. Wabula NCD ne Zstandard ekwata ekifo eky’enjawulo. Kisukkuluma mu mbeera z’okutandika ennyo nga olina ebyokulabirako ebitakka wansi wa 50 ebiwandiikiddwa buli kiraasi — embeera nga ne models ezirongooseddwa obulungi zilwana. Kyetaaga obudde bwa zero obw’okutendekebwa, kikwata olulimi lwonna oba enkodi awatali kukyusa, era kikola kyonna ku CPU nga kiriko jjukira eritali lya bulijjo.

Ku bizinensi eziddukanya ebitabo ebinene ebiyingira — tikiti z’obuyambi, okwogera ku mikutu gya yintaneeti, okwekenneenya ebintu — Zstandard NCD classifier esobola okukola nga router esooka okuyita egabanya ebiwandiiko mu kiseera ekituufu nga models ez’ebbeeyi tezinnaba kulongoosa bivuddemu. Payipu eno ey’emitendera ebiri ekendeeza nnyo ku nsaasaanya y’okuteebereza ate ng’ekuuma obutuufu okutwalira awamu. Enkola ezikola ku bintu ebikoleddwa abakozesa ku mutendera, nga Mewayz’s 207-module business OS ekozesebwa abasuubuzi abasukka mu 138,000, ziganyulwa mu kugabanya obuzito obutono okuyisa obubaka, okussaako akabonero ku birimu, n’okulongoosa obumanyirivu bw’abakozesa awatali bikozesebwa bizito.

Biki Ebikoma n’Enkola Ennungi?

Okugabanya okwesigamiziddwa ku kunyigiriza kulina obuzibu obumanyiddwa bw'osaanidde okubala. Ebiwandiiko ebimpi (wansi wa bytes 100) bivaamu obubonero bwa NCD obuteesigika kubanga compressor terina data emala okuzimba patterns ez’amakulu. Enkola eno era ekwata ku kulonda ebiwandiiko ebijuliziddwa — abakiikiridde abaalondeddwa obubi bakendeeza ku butuufu nnyo. Era olw’okuba NCD ye distance metric okusinga probabilistic model, mu butonde tefulumya bubonero bwa bwesige.

| Ku kugabanya ebiwandiiko ebitono, tendeka nkuluze ya Zstandard ku domain corpus yo — omutendera guno ogumu guyinza okulongoosa obutuufu n’obubonero 8-12 ku buli 100 ku biwandiiko ebimpi.

Ebibuuzo Ebitera Okubuuzibwa

Okugabanya okwesigamiziddwa ku kunyigirizibwa kukola okwekenneenya enneewulira?

Kisobola, naye nga kiriko okulabula. Okwekenenya enneewulira kwetaagisa okuzuula enjawulo z’amaloboozi ezitali za maanyi mu biwandiiko ebifaanagana mu nsengeka. NCD ekola bulungi mu kugabanya emitwe ng’ebiwandiiko mu biti eby’enjawulo bikozesa ebigambo eby’enjawulo. Ku sentiment, obutuufu typically lands around 55-60% — okusinga random, naye si production-ready ku bwayo. Okugatta ebifaananyi bya NCD n’enkola ya logistic regression model etali ya maanyi kitereeza ebivaamu nnyo.

Nsobola okukozesa modulo ya compression.zstd mu nkyusa za Python nga 3.14 tennabaawo?

Nedda. Module ya compression.zstd mpya mu Python 3.14. Ku nkyusa ezasooka, ssaako python-zstandard package okuva mu PyPI, egaba emirimu egyenkanankana compress() ne decompress(). Ensonga ya NCD esigala nga y’emu — ekiwandiiko ky’okuyingiza kyokka kye kikyuka. Bw'omala okulongoosa okutuuka ku 3.14, osobola okusuula okwesigamizibwa kw'omuntu ow'okusatu ddala.

Zstandard NCD ekola etya bw’ogeraageranya ne TF-IDF erimu okufaanagana kwa cosine?

Ku kugabanya emitwe egy’ebika bingi nga olina datasets ezikwatagana, TF-IDF plus cosine similarity mu bujjuvu etuuka ku 75-82% obutuufu bw’ogeraageranya ne Zstandard NCD’s 62-68%. Naye, TF-IDF yeetaaga vectoriser etuukira ddala, ebigambo ebitegeerekese, n’enkalala z’ebigambo ebiyimiridde ebikwata ku lulimi. Zstandard NCD tekyetaagisa kulongoosebwa kuno nga tekunnabaawo, ekola mu nnimi zonna ebweru w’ekibokisi, era egabanya ebiwandiiko ebipya mu kiseera ekitali kikyukakyuka awatali kulowooza ku bunene bw’ebigambo. Ku prototyping ey’amangu oba embeera z’ennimi nnyingi, NCD etera okuba ekkubo ery’amangu erigenda mu nkola ekola.

Oba ozimba payipu z'ebirimu ez'obwengula, okuyisa obubaka bwa bakasitoma, oba okukola prototyping classification logic for your digital business, Python 3.14's built-in Zstandard support efuula NCD eyesigamiziddwa ku compression okutuukirirwa okusinga bwe kyali kibadde. Bw’oba onoonya omukutu gwa byonna mu kimu okuddukanya ebirimu mu bizinensi yo, ebintu byo, emisomo, n’enkolagana ya bakasitoma, tandika okuzimba ne Mewayz leero era oteeke obukodyo buno okukola mu nkola yo yonna.

ezitakyukakyuka

Try Mewayz Free

All-in-one platform for CRM, invoicing, projects, HR & more. No credit card required.

Start managing your business smarter today

Join 30,000+ businesses. Free forever plan · No credit card required.

Ready to put this into practice?

Join 30,000+ businesses using Mewayz. Free forever plan — no credit card required.

Start Free Trial →

Ready to take action?

Start your free Mewayz trial today

All-in-one business platform. No credit card required.

Start Free →

14-day free trial · No credit card · Cancel anytime