Hacker News

Magulu a zolemba ndi gawo la Python 3.14's ZSTD

Magulu a zolemba ndi gawo la Python 3.14's ZSTD Kusanthula kwatsatanetsatane kwalembaku kumapereka kuwunika kwatsatanetsatane kwazigawo zake zazikulu komanso tanthauzo lalikulu. Magawo Ofunika Kwambiri Kukambitsirana kwakhazikika pa: Njira zazikuluzikulu ndi pro...

7 min read Via maxhalford.github.io

Mewayz Team

Editorial Team

Hacker News
Tsopano ndili ndi zonse zomwe ndikufuna. Ndiroleni ndilembe positi yabulogu.

Magulu Olemba ndi Python 3.14's ZSTD Module

Python 3.14 imayambitsa gawo la compression.zstd ku laibulale yokhazikika, ndipo imatsegula njira yamphamvu yodabwitsa yopangira malemba popanda zitsanzo zamakina ophunzirira. Poyesa momwe kompresa ingaphatikizire malemba awiri palimodzi, mukhoza kudziwa kufanana kwake - njira yotchedwa Normalized Compression Distance (NCD) - ndipo tsopano Zstandard imapangitsa kuti ikhale yothamanga mokwanira kuti ikhale yochuluka.

Kodi Compression-based Text Classification Imagwira Ntchito Motani?

Lingaliro lofunika kwambiri la kuphatikizika kozikidwa pagulu lakhazikika mu chiphunzitso cha chidziwitso. Pamene compression algorithm ngati Zstandard ikumana ndi zolemba zambiri, imapanga mtanthauzira wamkati wamapangidwe. Ngati malemba awiri ali ndi mawu ofanana, kalembedwe kake, ndi kamangidwe kake, kuwaphatikizira pamodzi kumatulutsa zotsatira zazikulu pang'ono kusiyana ndi kufinya mawu aakulu okha. Ngati ndi zosagwirizana, kukula kophatikizikako kumayandikira kuchuluka kwa makulidwe onse awiriwo.

Ubalewu wajambulidwa ndi njira ya Normalized Compression Distance: NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y)), pomwe C(x) ndi kukula kopanikizidwa kwa mawu x, ndipo C(xy) ndi kukula kopanikizidwa kwa zolemba ziwiri zolumikizidwa. Mtengo wa NCD pafupi ndi 0 umatanthawuza kuti malembawo ndi ofanana kwambiri, pamene mtengo wapafupi ndi 1 umatanthauza kuti sagawana zambiri.

Chomwe chimapangitsa kuti njirayi ikhale yodabwitsa ndikuti sichifuna chidziwitso cha maphunziro, palibe zizindikiro, palibe zoyikapo, komanso palibe GPU. Compressor palokha imakhala ngati chitsanzo chophunzirira cha kapangidwe kake. Kafukufuku wofalitsidwa m'mapepala ngati "Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors" (2023) adawonetsa kuti Gzip-based NCD NCD inapikisana ndi BERT pamabenchmarks ena, zomwe zidayambitsa chidwi chatsopano panjirayo.

Chifukwa chiyani Python 3.14's Zstandard Module ndi Game-Changer ya NCD?

Pamaso pa Python 3.14, kugwiritsa ntchito Zstandard kumafunika kukhazikitsa gulu lachitatu python-zstandard phukusi. The new compression.zstd module, yoyambitsidwa kudzera pa PEP 784, imatumiza mwachindunji ndi CPython. Izi zikutanthauza kudalira zero pamwamba ndi API yotsimikizika, yokhazikika yothandizidwa ndi libzstd yoyesedwa ndi Meta. Pazosankha zamagulu makamaka, Zstandard imapereka maubwino angapo kuposa gzip kapena bzip2:

  • Liwiro: Zstandard imapanikiza 3-5x mwachangu kuposa gzip mofananiza, kupangitsa kuti ma batch agawike pa masauzande a zolemba kuti azitha kugwira ntchito masekondi osati mphindi
  • Tunable milingo yoponderezedwa: Magawo 1 mpaka 22 amakulolani kusinthanitsa liwiro la chiŵerengero, kukulolani kuti muyese molondola za NCD motsutsana ndi zomwe mukufunikira
  • Chithandizo cha mtanthauzira mawu: Madikishonale ophunzitsidwa kale a Zstandard atha kuwongolera kwambiri kalembedwe kazolemba ting'onoting'ono (pansi pa 4KB), womwe uli ndendende kukula kwa zolemba komwe kulondola kwa NCD kumafunikira kwambiri
  • Streaming API: Gawoli limathandizira kukanikiza kochulukira, kupangitsa mapaipi a magulu omwe amasanthula zolemba popanda kuyika gulu lonse mukumbukiro
  • Kukhazikika kwa laibulale yokhazikika: Palibe mikangano yamitundu, palibe chiwopsezo chautundu — kuchokera ku compression import zstd imagwira ntchito pakuyika kulikonse kwa Python 3.14+

Chidziwitso chofunikira kwambiri: Kuphatikizika kotengera kutsata kumagwira ntchito bwino mukafuna njira yofulumira, yopanda kudalira yomwe imagwira mawu azilankhulo zambiri m'menemo. Chifukwa makina osindikizira amagwira ntchito pa ma byte aiwisi m'malo mwa zizindikiro za chinenero, amaika m'magulu a Chitchaina, Chiarabu, kapena zilankhulo zosakanikirana mofanana ndi Chingerezi - palibe chinenero chomwe chimafunikira.

Kodi Kugwiritsa Ntchito Mothandiza Kumawoneka Motani?

Kagulu kakang'ono ka NCD mu Python 3.14 kumakhala pansi pa mizere 30. Mumasindikiza malemba aliwonse (amodzi pagulu), kenako pa chikalata chatsopano chilichonse, phatikizani NCD motsutsana ndi chilichonse ndikugawa gawo lomwe lili ndi mtunda wotsika kwambiri. Nayi mfundo yayikulu:

Choyamba, lowetsani gawoli ndi kuchokera ku compression import zstd. Tanthauzirani ntchito yomwe imavomereza zingwe ziwiri za byte, kukakamiza aliyense payekhapayekha, kukakamiza kulumikizana kwawo, ndikubweza zigoli za NCD. Kenako pangani malembo a gulu la mtanthauzira mawu kuti akhale zitsanzo za malemba. Pachikalata chilichonse chomwe chikubwera, bwerezani magawo, phatikizani NCD, ndikusankha zochepa.

💡 DID YOU KNOW?

Mewayz replaces 8+ business tools in one platform

CRM · Invoicing · HR · Projects · Booking · eCommerce · POS · Analytics. Free forever plan available.

Start Free →

Poyerekeza ndi dataseti ya AG News (gulu lankhani zamagulu anayi), njira iyi yogwiritsira ntchito Zstandard pa compression level 3 imakwaniritsa pafupifupi 62-65% kulondola - palibe sitepe yophunzitsira, palibe kutsitsa kwachitsanzo, ndi liwiro la magawo pafupifupi 8,000 sekondi imodzi pa core CPU imodzi. Kukweza mulingo wa psinjika mpaka 10 kumakankhira kulondola kwa pafupifupi 68% pamtengo wochepetsera kutulutsa kwa zolemba pafupifupi 2,500 pamphindikati. Manambalawa samafanana ndi ma transfoma ochunidwa bwino, koma amapereka maziko olimba a ma prototyping, kuchuluka kwa zilembo za data, kapena malo omwe kuyikira kudalira kwa ML sikungatheke.

Kodi NCD Imafananiza Bwanji ndi Magulu Achikhalidwe a ML?

Yankho loona mtima ndiloti NCD siilowa m'malo mwa magulu a transformer-based classifiers mu machitidwe apamwamba kwambiri. Mitundu ngati BERT kapena GPT-based classifiers imakwaniritsa 94%+ yolondola pama benchmarks wamba. Komabe, NCD yokhala ndi Zstandard imakhala ndi niche yapadera. Imapambana muzochitika zozizira pomwe muli ndi zitsanzo zosakwana 50 zolembedwa pagulu lililonse - zomwe zimavutira ngakhale mitundu yosinthidwa bwino. Pamafunika ziro nthawi yophunzitsira, imagwira chilankhulo chilichonse kapena kukopera popanda kusinthidwa, ndipo imayenda pa CPU ndi kukumbukira kosalekeza.

Kwa mabizinesi omwe amayang'anira kuchuluka kwazinthu zomwe zikubwera - matikiti othandizira, zonena zapa media media, ndemanga zazinthu - gulu la Zstandard NCD litha kukhala ngati rauta yoyamba yomwe imayika zolemba munthawi yeniyeni zitsanzo zodula zisanakonze zotsatira. Mapaipi awiriwa amachepetsa mtengo wololera kwambiri ndikusunga kulondola kwathunthu. Mapulatifomu okonza zinthu zopangidwa ndi ogwiritsa ntchito pamlingo waukulu, monga Mewayz's 207-module business OS yogwiritsidwa ntchito ndi amalonda opitilira 138,000, amapindula ndi kusanjika kopepuka kupita ku mauthenga apanjira, zomwe zili ndi ma tag, ndikusintha zomwe ogwiritsa ntchito akumana nazo popanda zida zolemetsa.

Kodi Zolepheretsa ndi Zochita Zabwino Ndi Chiyani?

Magulu otengera kuponderezedwa adziwa malire omwe muyenera kuwawerengera. Zolemba zazifupi (pansi pa 100 byte) zimatulutsa ma NCD osadalirika chifukwa kompresa ilibe deta yokwanira kuti ipange mapangidwe atanthauzo. Njirayi imakhudzidwanso ndi kusankha kwa malemba - oimira osasankhidwa bwino amanyoza kulondola kwambiri. Ndipo chifukwa NCD ndi mtunda wa metric m'malo mongoyerekeza, sizimatulutsa zigoli zodalirika.

Kuti mupindule kwambiri ndi njira iyi: gwiritsani ntchito zolemba zosachepera 500 mabayiti pagulu lililonse, yesani kuphatikiza zitsanzo zingapo pagulu lililonse (zolemba zoyimira 2-3 zolumikizidwa pamodzi zimapereka mawu otanthauzira bwino), sinthani kalembedwe kake ndi whitespace musanakanikize, ndi benchmark kudutsa Zstandard compression milingo 3, 6, ndi 10 kuti mupeze kulondola kwa liwiro lanu. Pamagawo ang'onoang'ono, phunzitsanitu mtanthauzira mawu wa Zstandard pa domeni yanu corpus - sitepe imodzi iyi ikhoza kuwongolera kulondola ndi maperesenti 8-12 pazolemba zazifupi.

Mafunso Ofunsidwa Kawirikawiri

Kodi kuyika motengera kukakamiza kumagwira ntchito pakuwunika malingaliro?

Itha, koma ndi chenjezo. Kusanthula kwamaganizidwe kumafuna kuzindikira kusiyana kobisika kwa matawulidwe m'mawu ofanana. NCD imagwira ntchito bwino pakugawa mitu komwe zolemba m'magulu osiyanasiyana zimagwiritsa ntchito mawu osiyana. Pamalingaliro, kulondola nthawi zambiri kumafika pafupifupi 55-60% - bwino kuposa mwachisawawa, koma osakonzekera zokha. Kuphatikiza mawonekedwe a NCD ndi njira yopepuka yosinthira kumapangitsa zotsatira zabwino kwambiri.

Kodi ndingagwiritse ntchito gawo la compression.zstd mumitundu ya Python isanakwane 3.14?

Ayi. Gawo la compression.zstd ndi latsopano mu Python 3.14. Pamitundu yakale, ikani python-zstandard phukusi kuchokera ku PyPI, yomwe imapereka ntchito zofanana compress() ndi decompress(). Lingaliro la NCD limakhalabe lofanana - mawu okhawo omwe amalowetsamo amasintha. Mukakweza mpaka 3.14, mutha kusiya kudalira wina aliyense.

Kodi Zstandard NCD imagwira ntchito bwanji poyerekeza ndi TF-IDF yofanana ndi cosine?

Pamagulu amitu yambiri okhala ndi dataset yokhazikika, TF-IDF kuphatikiza kufanana kwa cosine nthawi zambiri kumakwaniritsa kulondola kwa 75-82% poyerekeza ndi Zstandard NCD's 62-68%. Komabe, TF-IDF imafuna vectoriser yolumikizidwa, mawu odziwika bwino, ndi mindandanda ya mawu oyimitsa a chilankhulo china. Zstandard NCD sichifuna kukonzanso izi, imagwira ntchito m'zilankhulo zambiri, ndikuyika zolemba zatsopano nthawi zonse mosasamala kanthu za kukula kwa mawu. Pazinthu zongoyerekeza kapena zilankhulo zambiri, NCD nthawi zambiri imakhala njira yofulumira yopita kumalo ogwirira ntchito.

Kaya mukupanga mapaipi ongopanga okha, kutumiza mauthenga kwamakasitomala, kapena malingaliro amtundu wabizinesi yanu ya digito, chithandizo cha Python 3.14 chopangidwa ndi Zstandard chimapangitsa kuti NCD yozikidwa pa compression ipezeke kuposa kale. Ngati mukuyang'ana nsanja zonse-inmodzi kuti muzitha kuyang'anira bizinesi yanu, malonda, maphunziro, ndi kuyankhulana ndi makasitomala, yambani kumanga ndi Mewayz lero ndipo tsatirani njirazi kuti zigwire ntchito yanu yonse.