Hacker News

Kurongeka kwemavara nePython 3.14's ZSTD module

Kurongeka kwemavara nePython 3.14's ZSTD module Uku kuwongorora kwakadzama kwechinyorwa kunopa ongororo yakadzama yezvayo zvakakosha zvikamu uye zvakakura zvinorehwa. Nzvimbo Dzakakosha dzeKutarisa Hurukuro yacho iri pa: Core michina uye pro ...

6 min read Via maxhalford.github.io

Mewayz Team

Editorial Team

Hacker News
Iye zvino ndava nemamiriro ezvinhu ose andinoda. Rega ndinyore iyo blog post.

Kuiswa Kwemavara nePython 3.14's ZSTD Module

Python 3.14 inosuma compression.zstd module kuraibhurari yemazuva ese, uye inovhura nzira ine simba rinoshamisa yekuisa muchikamu chemavara pasina mamodheru ekudzidza muchina. Nekuyera kuti compressor inogona sei kusvina zvinyorwa zviviri pamwechete, unokwanisa kuona kufanana kwadzo - nzira inonzi Normalized Compression Distance (NCD) - uye ikozvino Zstandard inoita kuti ikurumidze kuita basa rekugadzira.

Compression-based Text Classification Inonyatsoshanda Sei?

Iyo pfungwa huru kuseri kwekudzvanya-based classification yakadzika midzi mudzidziso yeruzivo. Kana iyo compression algorithm senge Zstandard ikasangana ne block yemavara, inovaka duramazwi remukati remapatani. Kana zvinyorwa zviviri zvichigovana mazwi akafanana, syntax, uye maumbirwo, kuabatanidza pamwechete kunoburitsa mhedzisiro yakakura zvishoma pane kudzvanya zvinyorwa zvakakura chete. Kana zvisina hukama, saizi yakamisikidzwa inosvika pahuwandu hwemasaizi ese ari maviri.

Hukama uhwu hwakatorwa neformalized Compression Distance formula: NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y)), apo C(x) iri saizi yakamisikidzwa yemavara x, uye C(xy) ndiyo saizi yakadzvanywa yezvinyorwa zvakabatanidzwa. Ukoshi hweNCD padhuze ne0 zvinoreva kuti zvinyorwa zvakada kufanana, ukuwo kukosha kuri pedyo ne1 zvinoreva kuti vanogovana zvinenge zvisina ruzivo.

Chii chinoita kuti hunyanzvi uhu hushamise ndechekuti haidi data yekudzidziswa, hapana tokenization, hapana inomisikidzwa, uye hapana GPU. Iyo compressor pachayo inoita seye yakadzidzwa modhi yechimiro chezvinyorwa. Tsvagiridzo yakaburitswa mumapepa akaita se "Low-Resource Text Classification: A Parameter-Yemahara Classification Method ine Compressors" (2023) yakaratidza kuti gzip-based NCD yakakwikwidza BERT pane mamwe mabenchmarks, ichimutsa kufarira patsva munzira.

Sei Python 3.14's Zstandard Module iGame-Changer yeNCD?

Pamberi pePython 3.14, kushandisa Zstandard kunodiwa kuisa yechitatu-bato python-zstandard package. Iyo itsva compression.zstd module, yakaunzwa kuburikidza nePEP 784, inotakura ngarava yakananga neCPython. Izvi zvinoreva zero kutsamira pamusoro uye yakavimbiswa, yakagadzikana API inotsigirwa neMeta's hondo-yakaedzwa libzstd. Zvekurongedza mabasa chaizvo, Zstandard inopa akati wandei mabhenefiti pamusoro pe gzip kana bzip2:

  • Sipedhi: Zstandard inomanikidza 3-5x nekukurumidza kupfuura gzip pachiyero chinofananidzwa, zvichiita kuti batch kupatsanurwa pamusoro pezviuru zvemagwaro kushande mumasekonzi kwete maminetsi
  • >
  • Rutsigiro rweduramazwi: Maduramazwi eZstandard akafanodzidziswa anogona kuvandudza kudzvanya kwemavara madiki (pasi pe4KB), inova ndiyo chaiyo saizi yegwaro uko kunonyanya kukosha kweNCD
  • Streaming API: Iyo module inotsigira kudzvanya kwekuwedzera, kugonesa mapaipi ekuisa muchikamu anogadzirisa zvinyorwa pasina kurodha corpora yese mundangariro
  • Kugadzikana kweraibhurari: Hapana kusawirirana kweshanduro, hapana njodzi yekutakura zvinhu — kubva kumanikidziro ekunze zstd inoshanda pakugadzwa kwega kwePython 3.14+

Muono wakakosha: Compression-based classification inoshanda zvakanyanya kana uchida chimbichimbi, chisina kutsamira pahwaro chinobata mitauro yakawanda sezvazviri. Nekuda kwekuti macompressor anoshanda pamabhaiti akaomeswa kwete zviratidzo zvemutauro chaiwo, anoisa muchiChinese, chiArabic, kana magwaro emitauro yakasanganiswa zvinonyatsoita seChirungu — hapana modhi yemutauro inodiwa.

Kushandisa Kunoita Kunotaridzika Sei?

Iyo shoma NCD classifier muPython 3.14 inokodzera pasi pemitsara makumi matatu. Iwe unoisa encode yega yega mareferensi mameseji (rimwe pachikamu), wozoti pagwaro idzva rega rega, komekedza iyo NCD ichipesana nereferensi yese uye wopa chikamu nechikamu chakaderera. Heinoi pfungwa huru:

Chekutanga, unza moduru ne kubva kumanikidziro ekunze zstd. Tsanangura basa rinogamuchira maviri byte tambo, kudzvanya imwe neimwe yega, kudzvanya kubatanidza kwavo, uye kudzosera iyo NCD mamaki. Wobva wagadzira mapeji echikamu cheduramazwi kune zvinyorwa zvinomiririra zvinyorwa. Pagwaro rega rega riri kuuya, dzokorora pamusoro pezvikamu, komputa NCD, uye sarudza shoma.

💡 DID YOU KNOW?

Mewayz replaces 8+ business tools in one platform

CRM · Invoicing · HR · Projects · Booking · eCommerce · POS · Analytics. Free forever plan available.

Start Free →

Mumabenchmarks anopikisana neAG News dataset (mana-class news classification), nzira iyi ichishandisa Zstandard pa compression level 3 inowana huchokwadi hunosvika 62-65% - hapana nhanho yekudzidzira, hapana kudhawunirodha modhi, uye kumhanya kwechikamu kweanosvika 8,000 zvinyorwa pasekondi imwe chete yeCPU core. Kusimudza mwero wekumanikidza kusvika gumi kunosundidzira huchokwadi kusvika kunosvika makumi matanhatu nesere muzana pamutengo wekudzikisa kubuditsa kusvika kumagwaro e2,500 pasekondi. Nhamba idzi hadzienderane nematransformer akakwenenzverwa, asi anopa hwaro hwakasimba hwekuita prototyping, data label triage, kana nharaunda umo kusingagone kuisa ML dependency.

NCD Inofananidzwa Sei NeChinyakare ML Classification?

Mhinduro yakatendeseka ndeyokuti NCD haisi kutsiva yeshanduro-yakavakirwa classifiers mune akakwira-stakes ekugadzira masisitimu. Mamodheru akaita seBERT kana GPT-based classifiers anowana 94%+ kunyatsoita pamabhenji akajairwa. Nekudaro, NCD ine Zstandard inogara yakasarudzika niche. Iyo inokunda mune inotonhora-yekutanga mamiriro apo iwe une isingasviki makumi mashanu yakanyorwa mienzaniso pakirasi - mamiriro ezvinhu apo kunyange mamodeli akakwenenzverwa anonetsekana. Zvinoda zero nguva yekudzidzira, kubata chero mutauro kana encoding pasina gadziriso, uye inoshanda chose paCPU ine ndangariro inogara iripo.

Kune mabhizinesi anotonga mavhoriyamu makuru ezvinouya - matikiti ekutsigira, kutaura kwesocial media, kuongororwa kwechigadzirwa - Zstandard NCD classifier inogona kushanda seyekutanga-pass router inoisa mumapoka magwaro munguva chaiyo mamodheru anodhura asati akwenenzvera mhedzisiro. Iyi pombi yematanho maviri inoderedza mitengo yekufungidzira zvakanyanya uku ichichengetedza huchokwadi hwese. Mapuratifomu ekugadzirisa zvinogadzirwa nemushandisi pachiyero, senge Mewayz's 207-module bhizinesi OS inoshandiswa nevemabhizinesi vanodarika 138,000, vanobatsirikana kubva pakurongeka kwakapfava kuenda kunzira yekutumira mameseji, ma tag ezvinyorwa, uye kugadzirisa zviitiko zvemushandisi pasina zvinorema.

Ndezvipi Zvinogumira uye Maitiro Akanakisisa?

Compression-based classification inoziva zvipimo zvaunofanira kuzvidavirira. Zvinyorwa zvipfupi (pasi pe100 bytes) zvinoburitsa zvibodzwa zvisingavimbike zveNCD nekuti compressor haina data rakakwana rekuvaka mapatani ane musoro. Iyo tekinoroji zvakare inotarisisa kusarudzwa kwezvinyorwa zvinonongedza - vamiriri vasina kusarudzwa vanodzikisira kurongeka zvakanyanya. Uye nekuti NCD iri chinhambwe chemetric kwete modhi inogoneka, haigadzi zvibodzwa zvekuvimba.

Kuti uwane zvakanyanya kubva pakuita izvi: shandisa zvinyorwa zvemabhaiti angangoita 500 pachikamu chimwe nechimwe, edza nekubatanidza mienzaniso yakawanda pakirasi (magwaro emumiriri 2-3 akabatanidzwa anoburitsa maduramazwi ari nani), ita kuti mavara machena uye whitespace isati yadzvanywa, uye kuenzanisa paZstandard compression level 3, 6, uye 10 kuti uwane kukurumidza kunotapira. Kuti utore zvinyorwa zvidiki-diki, gara wadzidzisa duramazwi reZstandard padomeini yako corpus — danho rimwe chete iri rinogona kuvandudza kurongeka ne8-12 percentage points pamagwaro mapfupi.

Mibvunzo Inowanzo bvunzwa

Ko compression-based classification inoshanda pakuongorora manzwiro here?

Zvinogona, asi nemapako. Ongororo yemanzwiro inoda kuona misiyano isinganzwisisike yematoni mukati mezvinyorwa zvakafanana. NCD inoshanda zvirinani pakurongedza misoro apo zvinyorwa muzvikamu zvakasiyana zvinoshandisa mazwi akasiyana. Zvemanzwiro, kunyatsoita kunowanzo kutenderera 55-60% - zviri nani pane zvisina kujairika, asi kwete kugadzira-yakagadzirira pachayo. Kubatanidza maNCD neakareruka logistic regression modhi inovandudza mibairo zvakanyanya.

Ndingashandisa compression.zstd module muPython shanduro dzisati dzasvika 3.14?

Kwete. Iyo compression.zstd module inyowani muPython 3.14. Kune mavhezheni ekutanga, isa python-zstandard package kubva kuPyPI, inopa zvakafanana compress() uye decompress() mabasa. Iyo NCD logic inoramba yakafanana - chete chirevo chekutumira chinoshanduka. Kana wangosimudzira kusvika pa3.14, unokwanisa kudonhedza kutsamira kwevechitatu zvachose.

Zstandard NCD inoita sei zvichienzaniswa neTF-IDF ine cosine yakafanana?

Pamusoro wemhando dzakasiyana-siyana dzemusoro wenyaya dzine madatasets akaenzana, TF-IDF pamwe necosine kufanana kunowanzoita 75-82% kunyatsoita zvichienzaniswa neZstandard NCD's 62-68%. Nekudaro, TF-IDF inoda vhekitari yakashongedzerwa, dura remazwi rakatsanangurwa, uye mitauro-yakanangana yekumira mazita. Zstandard NCD haidi kana imwe yeiyi preprocessing, inoshanda mumitauro kunze kwebhokisi, uye inoisa zvinyorwa zvitsva munguva inogara zvisinei nehukuru hwemazwi. Kuchimbidzo cheprototyping kana nharaunda dzemitauro yakawanda, NCD inowanzova nzira yekukurumidza kuenda kuhurongwa hwekushanda.

Kunyangwe uri kuvaka otomatiki mapombi emukati, kutumira mameseji evatengi, kana prototyping classification logic yebhizinesi rako redhijitari, Python 3.14's yakavakirwa-mukati Zstandard rutsigiro inoita kuti compression-based NCD iwanikwe zvakanyanya kupfuura nakare kose. Kana uri kutsvaga chikuva chese-in-one chekugadzirisa zvirimo mubhizinesi rako, zvigadzirwa, makosi, uye kudyidzana nevatengi, tanga kuvaka neMewayz nhasi uye isa nzira idzi kuti dzishande pese pese pese.