Hacker News

Classification ya texte na module ZSTD ya Python 3.14

Classification ya texte na module ZSTD ya Python 3.14 Analyse complète oyo ya texte epesi examen détaillé ya ba composantes na yango ya moboko mpe ba implications ya large. Makambo ya ntina oyo osengeli kotya likebi mingi Lisolo yango elobeli mingi: Mecanismes ya moboko na pro...

9 min read Via maxhalford.github.io

Mewayz Team

Editorial Team

Hacker News
Sikoyo nazali na contexte nionso oyo nasengeli na yango. Tika nakoma poste ya blog.

Bokabolami ya makomi na Module ZSTD ya Python 3.14

Python 3.14 ekotisaka module compression.zstd na bibliothèque standard, mpe efungolaka lolenge ya nguya ya kokamwa mpo na botangi ya makomi kozanga ba modèles ya boyekoli ya masini. Na komekaka ndenge nini compresseur ekoki kofina malamu makomi mibale esika moko, okoki koyeba bokokani na yango — technique oyo babengi Normalized Compression Distance (NCD) — mpe sikoyo Zstandard ekomisi yango mbangu mpo na ba charges ya mosala ya production.

Ndenge nini classification ya texte basé na compression esalaka vraiment?

Likanisi ya moboko oyo ezali sima ya classification basée na compression ezali na misisa na théorie ya information. Tango algorithme ya compression lokola Zstandard ekutani na bloc ya texte, etongaka dictionnaire interne ya ba modèles. Soki makomi mibale ezali na maloba, syntaxe mpe ebongiseli ndenge moko, kofinafina yango esika moko ebimisaka bobele mwa monene koleka kofinafina makomi ya monene yango moko. Soki bazali na boyokani te, bonene ya compressé oyo ekangami epusani penepene na somme ya bonene nyonso mibale ya moto na moto.

Boyokani oyo ekangami na formule ya Distance ya Compression Normalisée : NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y)), esika C(x) ezali taille comprimée ya texte x, mpe C(xy) ezali taille comprimée ya ba textes mibale oyo ekangami. Motuya ya NCD pene na 0 elakisi ete makomi ekokani mingi, nzokande motuya pene na 1 elakisi ete bakabolaka pene na makambo ya sango te.

Eloko esalaka que technique oyo ezala remarquable ezali que esengaka ba données ya formation te, tokenization te, embeddings te, pe GPU te. Compresseur yango moko esalaka lokola modèle oyo bayekoli ya structure ya texte. Bolukiluki oyo ebimisamaki na mikanda lokola "Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors" (2023) elakisaki ete NCD oyo esalemi na gzip ekutanaki na BERT na ba benchmarks mosusu, ebimisaki bosepeli ya sika na ndenge ya kosala.

Mpo na nini Module Zstandard ya Python 3.14 Ezali Game-Changer mpo na NCD?

Yambo ya Python 3.14, kosalela Zstandard esengelaki kotia liboke ya moto ya misato python-zstandard. Module ya sika compression.zstd, oyo ekotisami na nzela ya PEP 784, etindamaka mbala moko na CPython. Yango elingi koloba zéro dépendance surcharge mpe API garanti, stable oyo esungami na libzstd ya Meta oyo emekamaki na bitumba. Mpo na misala ya botangi mpenza, Zstandard epesaka matomba mingi koleka gzip to bzip2:

  • Vitesse : Zstandard e compresser 3-5x mbangu koleka gzip na ba rapports comparables, kosala que classification ya batch likolo ya ba nkoto ya mikanda ezala viable na ba secondes na esika ya ba minutes
  • Niveau ya compression tunable : Niveau 1 kino 22 e permettre yo o trader vitesse na ratio, e permettre yo o calibration ya précision ya NCD contre ba exigences ya débit
  • Lisungi ya diksionɛrɛ : Ba diksionɛrɛ ya Zstandard oyo ezwaki formasyo liboso ekoki kobongisa mpenza bozindisi ya makomi ya mike mike (na nse ya 4KB), oyo ezali mpenza esika ya bonene ya mokanda epai bosikisiki ya NCD ezali na ntina mingi
  • API ya streaming : Module esungaka compression incrementale, epesaka nzela na ba pipelines ya classification oyo esalaka ba textes sans ko charger ba corpora mobimba na mémoire
  • Stable ya bibliothèque standard: Matata ya version te, risque ya chaîne d’approvisionnement te — euti na importation ya compression zstd esalaka na installation nionso ya Python 3.14+

Bososoli ya ntina: Botangi oyo esalemi na bozindisi esalaka malamu mingi ntango ozali na mposa ya moboko ya mbangu, oyo ezali na bozangi boyokani oyo esimbaka makomi ya minoko mingi na ndenge ya bomoto. Lokola ba compresseurs esalaka na ba octets bruts na esika ya ba jetons spécifiques ya monoko, ba classifier mikanda ya Chinois, Arabe, to ya langue mélangée kaka ndenge moko na anglais — modèle ya monoko esengeli te.

, oyo ezali

Bosaleli ya kosalela ezali ndenge nini?

Classificateur ya NCD ya moke na Python 3.14 ekoti na se ya 30 lignes. O encoder texte ya référence moko na moko (moko na catégorie moko), sima pona mokanda moko na moko ya sika, o calculer NCD contre référence nionso pe o assigner catégorie oyo ezali na distance ya se. Tala logique ya moboko:

Ya liboso, kotisa module na uta na compression import zstd. Limbola fonction oyo endimaka ba chaînes mibale ya octets, e compresser moko na moko, e compresser concatenation na yango, pe ezongisa score ya NCD. Na sima tonga dictionnaire oyo ezali kosala cartographie ya ba étiquettes ya catégorie na ba textes échantillons représentants. Mpo na mokanda moko na moko oyo ekoti, zongela na ba catégories, sala calcul ya NCD, mpe pona minimum.

💡 DID YOU KNOW?

Mewayz replaces 8+ business tools in one platform

CRM · Invoicing · HR · Projects · Booking · eCommerce · POS · Analytics. Free forever plan available.

Start Free →

Na ba benchmarks contre ensemble ya ba données ya AG News (classification ya sango ya classe minei), approche oyo kosalela Zstandard na niveau ya compression 3 ekokisaka environ 62-65% ya précision — étape ya formation te, téléchargement ya modèle te, mpe vitesse ya classification ya soki 8.000 documents par seconde na noyau moko ya CPU. Komatisaka niveau ya compression na 10 epusaka précision na environ 68% na coût ya ko réduire débit na soki 2.500 documents par seconde. Mituya oyo ekokani te na ba transformateurs oyo ebongwani malamu, kasi epesaka base ya makasi mpo na prototypage, triage ya étiquetage ya ba données, to ba environnements esika ko installer ba dépendances ya ML ezali impratique.

Ndenge nini NCD ekokani na Classification ya ML ya bonkoko?

Eyano ya bosembo ezali ete NCD ezali te remplacement ya ba classificateurs basés na transformateur na ba systèmes ya production ya haute enjeux. Ba modèles lokola ba classificateurs basés na BERT to GPT ezuaka 94%+ ya précision na ba benchmarks standard. Kasi, NCD na Zstandard ezui niche moko unique. Ezali koleka na ba scénarios ya kobanda na malili esika ozali na moins de 50 exemples étiquetés par classe — situation esika ata ba modèles fine-tuned ebundaka. Esengaka tango ya formation zéro, esimbaka monoko to encodage nionso sans modification, pe ezo tambola mobimba na CPU na mémoire constant.

| Pipeline oyo ya deux étapes ekitisaka mingi ba coûts ya inférence tango ezali kobatela bosikisiki ya mobimba. Ba plateformes oyo esalelaka ba contenus oyo basaleli na échelle, lokola OS ya mombongo ya Mewayz oyo ezali na module 207 oyo esalelamaka na ba entrepreneurs koleka 138.000, ezuaka litomba na classification ya pete pona ko router ba messages, ko tag contenus, pe ko personnaliser ba expériences ya usager sans infrastructure ya kilo.

Nini ezali ndelo mpe misala ya malamu?

Classification basée na compression ezali na ba limitations eyebani oyo esengeli o comptabiliser. Ba textes ya mikuse (na se ya 100 octets) ebimisaka ba scores ya NCD oyo ekoki kozala na confiance te po compresseur ezali na ba données ekoki te pona kotonga ba modèles ya tina. Technique ezali mpe sensible na pona ya ba textes ya référence — ba représentants oyo baponamaki malamu te ba dégrader précision makasi. Mpe lokola NCD ezali métrique ya distance na esika ya modèle probabiliste, ebimisaka na ndenge ya nature ba scores ya confiance te.

| Mpo na botangi ya makomi ya mike, formasyo liboso diksionɛrɛ ya Zstandard na corpus ya domaine na yo — litambe moko oyo ekoki kobongisa bosikisiki na 8-12 points pourcentage na mikanda ya mikuse.

Mituna oyo batunaka mingi

Est-ce que classification basée na compression esalaka pona analyse ya sentiment?

Ekoki, kasi na makebisi. Analyse ya sentiment esengaka ko détecter ba différences tonales subtile na kati ya ba textes structurellement similaires. NCD esalaka malamu mpo na botangi ya mitó ya makambo esika wapi mikanda na biteni ndenge na ndenge esalela maloba ekeseni. Mpo na sentiment, précision typiquement atterrir autour ya 55-60% — malamu koleka random, kasi te production-prêt na yango moko. Kosangisa makambo ya NCD na modèle ya régression logistique ya poids léger ebongisaka ba résultats mingi.

Nakoki kosalela module compression.zstd na ba versions ya Python liboso ya 3.14?

Te. Module compression.zstd ezali ya sika na Python 3.14. Mpo na ba versions ya kala, tia ensemble python-zstandard uta na PyPI, oyo epesaka misala ya compress() mpe decompress() oyo ekokani. Logique ya NCD etikali ndenge moko — kaka déclaration ya importation nde ebongwanaka. Soki omati na 3.14, okoki kobwaka dépendance ya moto ya misato mobimba.

Ndenge nini Zstandard NCD esalaka soki tokokanisi yango na TF-IDF na bokokani ya cosine?

Na classification ya ba sujets multi-class na ba ensembles ya ba données équilibrés, TF-IDF plus similarité ya cosine ezuaka typiquement 75-82% ya précision soki tokokanisi yango na 62-68% ya Zstandard NCD. Kasi, TF-IDF esengaka vectoriser oyo ebongi, vocabulaire défini, mpe ba listes ya ba mots arrêts spécifiques ya monoko. Zstandard NCD esengaka ata moko te ya prétraitement oyo, esalaka na minoko nionso libanda ya boîte, pe e classifier mikanda ya sika na tango constant sans considération ya taille ya vocabulaire. Mpo na prototype ya mbangu to bisika ya minoko mingi, mbala mingi NCD ezali nzela ya mbangu mpo na système ya mosala.

Ezala ozali kotonga ba pipelines ya contenus automatique, kosala routage ya ba messages ya client, to prototype ya logique ya classification pona entreprise na yo ya numérique, soutien Zstandard intégré ya Python 3.14 ekomisaka NCD basé na compression ezala accessible koleka liboso. Soki ozali koluka plateforme ya nionso na moko pona ko gérer contenus ya entreprise na yo, produits, ba cours, pe ba interactions ya ba clients, banda kotonga na Mewayz lelo pe tia ba techniques oyo na mosala na kati ya opération na yo mobimba.

ya constant

Try Mewayz Free

All-in-one platform for CRM, invoicing, projects, HR & more. No credit card required.

Start managing your business smarter today

Join 30,000+ businesses. Free forever plan · No credit card required.

Ready to put this into practice?

Join 30,000+ businesses using Mewayz. Free forever plan — no credit card required.

Start Free Trial →

Ready to take action?

Start your free Mewayz trial today

All-in-one business platform. No credit card required.

Start Free →

14-day free trial · No credit card · Cancel anytime