Hacker News

Nkyerɛwee nkyekyɛmu a wɔde Python 3.14 ZSTD module no di dwuma

Nkyerɛwee nkyekyɛmu a wɔde Python 3.14 ZSTD module no di dwuma Saa nkyerɛwee mu nhwehwɛmu a edi mũ yi ma wɔhwehwɛ emu nneɛma atitiriw ne nea ɛkyerɛ a ɛtrɛw no mu kɔ akyiri. Mmeae Titiriw a Ɛsɛ sɛ Wode Wɔn Si Adwene So Nkɔmmɔbɔ no twe adwene si: Core mfiri ne pro...

11 min read Via maxhalford.github.io

Mewayz Team

Editorial Team

Hacker News
Seesei manya nsɛm a ɛfa ho a mihia nyinaa. Ma menkyerɛw blog post no.

Nkyerɛwee Nkyekyɛmu a ɛwɔ Python 3.14 ZSTD Module

Python 3.14 de compression.zstd module no ba standard library no mu, na ɛbue ɔkwan a tumi wom a ɛyɛ nwonwa a wɔfa so kyekyɛ nsɛm mu a mfiri adesua nhwɛsoɔ nni mu. Sɛ wosusu sɛdeɛ compressor tumi pia nsɛm mmienu bom yie a, wobɛtumi ahunu wɔn nsɛsoɔ — ɔkwan bi a wɔfrɛ no Normalized Compression Distance (NCD) — na seesei Zstandard ma ɛyɛ ntɛmntɛm sɛdeɛ ɛbɛyɛ a ɛbɛma adwumayɛ adwuma.

Ɔkwan bɛn so na Compression-Based Text Classification Yɛ Adwuma Ankasa?

Adwene titiriw a ɛwɔ nkyekyɛmu a egyina nhyɛso so akyi no gyina nsɛm ho nsusuwii mu. Sɛ compression algorithm te sɛ Zstandard hyia nkyerɛwee block a, ɛkyekye nsɛm asekyerɛ nhoma a ɛwɔ mu a ɛkyerɛ nhwɛso ahorow. Sɛ nkyerɛwee abien wɔ nsɛmfua, kasamufa, ne nhyehyɛe a ɛyɛ pɛ a, sɛ wopiapia no a, nea efi mu ba no yɛ kɛse kakra sen sɛ wobɛmia nkyerɛwee kɛse no nkutoo. Sɛ wɔnyɛ abusuabɔ a, concatenated compressed size no bɛn ankorankoro akɛse abien no nyinaa nyinaa.

Wɔde Normalized Compression Distance fomula no na ɛkyere saa abusuabɔ yi: NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y)), a C(x) yɛ nkyerɛwee x kɛse a wɔahyɛ no den, na C(xy) yɛ nkyerɛwee abien a wɔaka abom no kɛse a wɔahyɛ no den. NCD boɔ a ɛbɛn 0 kyerɛ sɛ nkyerɛwee no di nsɛ kɛseɛ, berɛ a boɔ a ɛbɛn 1 kyerɛ sɛ ɛkame ayɛ sɛ wɔkyɛ nsɛm biara.

Nea ɛma saa kwan yi yɛ nwonwa ne sɛ enhia ntetee data biara, tokenization biara, embeddings biara, ne GPU biara. Compressor no ankasa yɛ adwuma sɛ nkyerɛwee no nhyehyɛe ho nhwɛso a wɔasua. Nhwehwɛmu a wotintimii wɔ nkrataa te sɛ "Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors" (2023) mu no kyerɛɛ sɛ gzip-based NCD ne BERT di akan wɔ benchmarks bi so, na ɛkanyan anigye foforo wɔ ɔkwan no ho.

Dɛn nti na Python 3.14 Zstandard Module yɛ Game-Changer ma NCD?

Ansa na Python 3.14 reba no, na Zstandard a wode bedi dwuma no hwehwɛ sɛ wode python-zstandard paket a ɛto so abiɛsa no hyɛ mu. compression.zstd module foforɔ no, a ɛnam PEP 784 so de aba no, de CPython kɔ tẽẽ. Wei kyerɛ sɛ zero dependency overhead ne guaranteed, stable API a Meta ɔko-sɔhwɛ libzstd gyina akyi. Wɔ nkyekyɛmu nnwuma pɔtee ho no, Zstandard de mfasoɔ pii ma sen gzip anaa bzip2:

  • Ahoɔhare: Zstandard mia 3-5x ntɛmntɛm sen gzip wɔ nsusuwii a wɔde toto ho, na ɛma batch nkyekyɛmu wɔ nkrataa mpempem pii so tumi yɛ adwuma wɔ sikani mu sen simma
  • Tunable compression levels: Levels 1 kɔsi 22 ma wo sesa ahoɔhare ma ratio, ɛma wo kwan ma wo calibrate NCD precision to throughput requirements
  • Nsɛm asekyerɛ nhoma mmoa: Zstandard nsɛm asekyerɛ nhoma a wɔadi kan atete no betumi ama nkyerɛwee nketewa a wɔtwetwe (wɔ 4KB ase) atu mpɔn kɛse, a ɛyɛ krataa kɛse a ɛwɔ baabi a NCD pɛpɛɛpɛyɛ ho hia kɛse
  • Streaming API: Module no boa incremental compression, ɛma classification pipelines a ɛdi nsɛm ho dwuma a ɛnmfa corpora nyinaa nkɔ memory
  • mu
  • Standard library stability: Version biara nhyia, asiane biara nni supply chain — fi compression import zstd yɛ adwuma wɔ Python 3.14+ instɔlehyɛn biara so
a wɔde ahyɛ mu

Nhumu titiriw: Nkyekyɛmu a egyina nhyɛsoɔ so yɛ adwuma yie berɛ a wo hia mfitiaseɛ a ɛyɛ ntɛm, a ɛnyɛ nea ɛde ne ho to so a ɛdi kasa ahodoɔ nkyerɛwee ho dwuma wɔ ne kwan so. Esiane sɛ compressor ahorow no yɛ adwuma wɔ raw bytes so sen kasa pɔtee bi token so nti, wɔkyekyɛ China, Arabic, anaa kasa a wɔadi afra nkrataa mu yiye te sɛ Borɔfo — kasa nhwɛso biara nhia.

na ɛkyerɛ sɛ woayɛ

Dɛn na Nnwuma a Wɔde Di Dwuma a Wɔde Di Dwuma Teɛ?

NCD nkyekyɛmu a ɛsua koraa wɔ Python 3.14 mu no fata wɔ nkyerɛwdeɛ 30 ase. Wode encode reference text biara (baako wɔ category biara mu), afei wɔ krataa foforo biara ho no, bu NCD no ho akontaa fa reference biara ho na fa category a ɛwɔ akyirikyiri a ɛba fam koraa no ma. Ntease titiriw no ni:

Nea edi kan no, fa module no a fi compression import zstd mu. Kyerɛkyerɛ dwumadie a ɛgye baiti nhama mmienu, mia emu biara mmiako mmiako, mia wɔn nkabom, na ɛsan de NCD nkontabuo no ba. Afei yɛ nsɛm asekyerɛ nhoma a ɛkyerɛ ɔfã nkyerɛwde ahorow kɔ ananmusifo nhwɛso nkyerɛwee ahorow so. Wɔ krataa biara a ɛba no ho no, san yɛ akuw ahorow so, bu NCD ho akontaa, na paw nea esua koraa.

💡 DID YOU KNOW?

Mewayz replaces 8+ business tools in one platform

CRM · Invoicing · HR · Projects · Booking · eCommerce · POS · Analytics. Free forever plan available.

Start Free →

Wɔ benchmarks a ɛne AG News dataset (four-class news classification), saa kwan yi a wɔde Zstandard di dwuma wɔ compression level 3 no nya bɛyɛ 62-65% pɛpɛɛpɛ — ntetee anammɔn biara nni hɔ, model download biara nni hɔ, ne nkyekyɛmu ahoɔhare a ɛyɛ bɛyɛ 8,000 nkrataa wɔ sekan biara so wɔ CPU core biako so. Sɛ wɔma compression level no kɔ soro kɔ 10 a, ɛpia pɛpɛɛpɛyɛ kɔ bɛyɛ 68% a ɛka a wɔbɔ sɛ wɔbɛtew throughput so akɔ bɛyɛ nkrataa 2,500 wɔ sekan biara mu. Saa nɔma yi nhyia transformers a wɔayɛ no yie, nanso ɛma mfitiaseɛ a ɛyɛ den ma prototyping, data labeling triage, anaa mmeaeɛ a ML dependencies a wɔde besisi hɔ no nyɛ adwuma.

Ɔkwan Bɛn so na NCD Toto Amanneɛbɔ ML Nkyekyɛmu Ho?

Mmuae a ɛyɛ nokware ne sɛ NCD nyɛ ade a wɔde besi transformer-based classifiers ananmu wɔ high-stakes production systems mu. Models te sɛ BERT anaa GPT-based classifiers nya 94%+ pɛpɛɛpɛ wɔ standard benchmarks so. Nanso, NCD a ɛwɔ Zstandard no gye niche soronko bi. Ɛdi mu wɔ awɔw-fi ase tebea horow a wowɔ nhwɛso ahorow a wɔakyerɛw so a ennu 50 wɔ adesuakuw biara mu — tebea a mpo mfonini ahorow a wɔayɛ no yiye mpo di aperepere. Ɛhwehwɛ sɛ wɔde ntetee bere zero, ɛdi kasa anaa encoding biara ho dwuma a ɛnsakra, na ɛyɛ adwuma koraa wɔ CPU a ɛwɔ memory a ɛkɔ so daa so.

Wɔ nnwuma a wɔhwɛ nneɛma a ɛba pii so — mmoa tekiti, social media mentions, product reviews — Zstandard NCD classifier betumi ayɛ adwuma sɛ first-pass router a ɛkyekyɛ nkrataa mu wɔ bere ankasa mu ansa na models a ne bo yɛ den asiesie nea efi mu ba no. Saa nsuo afiri a ɛwɔ akwan mmienu yi brɛ nsusuiɛ ho ka ase kɛseɛ berɛ a ɛkura pɛpɛɛpɛyɛ nyinaa mu. Platforms a ɛdi nneɛma a ɔdefoɔ ayɛ ho adwuma wɔ scale mu, te sɛ Mewayz 207-module business OS a adwumayɛfoɔ bɛboro 138,000 de di dwuma no, nya mfasoɔ firi nkyekyɛmu a emu yɛ hare mu de fa nkrasɛm kwan, tag content, na ɛyɛ personalize user experiences a enni infrastructure a emu yɛ duru.

Dɛn Ne Anohyeto ne Nneyɛe Pa?

Compression-based classification wɔ anohyeto ahorow a wonim a ɛsɛ sɛ wubu ho akontaa. Nkyerɛwee ntiantiaa (a ɛwɔ baiti 100 ase) ma NCD nkontabuo a wontumi mfa ho nto so ɛfiri sɛ compressor no nni data a ɛdɔɔso a ɛbɛtumi ayɛ nhwɛsoɔ a nteaseɛ wom. Ɔkwan no nso yɛ nea ɛfa nsɛm a wɔde kyerɛw nsɛm a wɔpaw ho — ananmusifo a wɔanpaw wɔn yiye no brɛ pɛpɛɛpɛyɛ ase kɛse. Na esiane sɛ NCD yɛ akyirikyiri metric mmom sen sɛ ɛbɛyɛ probabilistic model nti, ɛnyɛ abɔde mu de ahotoso nkontabuo mma.

Sɛ wopɛ sɛ wonya mfasoɔ kɛseɛ firi saa kwan yi mu a: fa nkyerɛkyerɛmu nsɛm a anyɛ yie koraa no ɛyɛ baiti 500 di dwuma wɔ ɔfa biara mu, sɔ nhwɛsoɔ pii a wɔde abɔ mu hwɛ wɔ adesua biara mu (ananmusifoɔ nkrataa 2-3 a wɔaka abom no ma nsɛm asekyerɛ nwoma a ɛyɛ papa), yɛ nkyerɛwee casing ne whitespace no normalize ansa na compression, na benchmark wɔ Zstandard compression levels 3, 6, ne 10 so na woahu wo speed-accuracy sweet spot. Sɛ wopɛ nsɛm nketewa nkyekyɛmu a, di kan tete Zstandard nsɛm asekyerɛ nhoma wɔ wo domain corpus so — anammɔn biako yi betumi ama pɛpɛɛpɛyɛ atu mpɔn ɔha biara mu nkyem 8-12 wɔ nkrataa ntiantiaa so.

Nsɛmmisa a Wɔtaa Bisa

So nkyekyɛmu a egyina nhyɛsoɔ so yɛ adwuma ma nkateɛ nhwehwɛmu?

Ebetumi, nanso ɛwɔ kɔkɔbɔ ahorow. Nkate mu nhwehwɛmu hwehwɛ sɛ wohu ɛnne mu nsonsonoe a ɛnyɛ anifere wɔ nkyerɛwee ahorow a ɛte sɛ nea wɔahyehyɛ no mu. NCD yɛ adwuma yie ma asɛmti nkyekyɛmu a nkrataa a ɛwɔ akuo ahodoɔ mu de nsɛmfua ahodoɔ di dwuma. Sɛ wopɛ sɛ wonya nkate a, pɛpɛɛpɛyɛ taa si fam bɛyɛ 55-60% — eye sen random, nanso ɛnyɛ production-ready wɔ n’ankasa so. Sɛ wɔde NCD nneɛma ne logistic regression model a emu yɛ hare bom a, ɛma nea efi mu ba no tu mpɔn kɛse.

So metumi de compression.zstd module no adi dwuma wɔ Python nkyerɛase ahorow mu ansa na 3.14 adu?

Dabi. compression.zstd module no yɛ foforo wɔ Python 3.14 mu. Wɔ nkyerɛaseɛ a atwam no ho no, instɔl python-zstandard paket no firi PyPI, a ɛma compress() ne decompress() dwumadie a ɛyɛ pɛ. NCD nteaseɛ no da so ara yɛ pɛ — import statement no nko ara na ɛsesa. Sɛ wo kɔ soro kɔ 3.14 wie a, wobɛtumi agyae nnipa a wɔto so mmiɛnsa a wɔde wɔn ho to so no koraa.

Sɛ wɔde Zstandard NCD yɛ adwuma dɛn sɛ wɔde toto TF-IDF a cosine di nsɛ ho a?

Wɔ multi-class topic classification a ɛkari pɛ datasets, TF-IDF ne cosine nsɛsoɔ taa nya 75-82% pɛpɛɛpɛ sɛ wɔde toto Zstandard NCD 62-68% ho a. Nanso, TF-IDF hwehwɛ sɛ wɔde vectoriser a ɛfata, nsɛmfua a wɔakyerɛkyerɛ mu, ne kasa pɔtee bi a wɔde gyina hɔ ma. Zstandard NCD nhwehwɛ sɛ wɔyɛ saa preprocessing yi mu biara, ɛyɛ adwuma wɔ kasa ahodoɔ mu wɔ adaka no mu, na ɛkyekyɛ nkrataa foforɔ mu wɔ berɛ a ɛkɔ so daa mu a nsɛmfua kɛseɛ mfa ho. Wɔ prototyping ntɛmntɛm anaa kasa pii mpɔtam hɔ no, NCD taa yɛ ɔkwan a ɛkɔ ntɛmntɛm a ɛkɔ nhyehyɛe a ɛyɛ adwuma mu.

Sɛ́ ebia woreyɛ automated content pipelines, routing customer messages, anaasɛ prototyping classification logic ama wo digital adwuma no, Python 3.14 no Zstandard mmoa a wɔde ahyɛ mu no ma NCD a egyina compression so no yɛ nea wotumi nya sen bere biara. Sɛ worehwehwɛ biribiara a ɛwɔ baako mu a wode bɛhwɛ w’adwuma mu nsɛm, nneɛma, adesua, ne adetɔfoɔ nkitahodiɛ so a, fi ase ne Mewayz si nnɛ na fa saa akwan yi yɛ adwuma wɔ wo dwumadie nyinaa mu.

daa

Try Mewayz Free

All-in-one platform for CRM, invoicing, projects, HR & more. No credit card required.

Start managing your business smarter today

Join 30,000+ businesses. Free forever plan · No credit card required.

Ready to put this into practice?

Join 30,000+ businesses using Mewayz. Free forever plan — no credit card required.

Start Free Trial →

Ready to take action?

Start your free Mewayz trial today

All-in-one business platform. No credit card required.

Start Free →

14-day free trial · No credit card · Cancel anytime