Home /comprendre The Math Business API

Training Data

Two datasets. Two Snake models. Every prediction traces back to here.

How data flows through the product

Text in

→

Extraction (Haiku or regex)

→

doc_type model

→

price_anomaly model (per line)

→

Decision + trust score

The extraction step turns free text into features. The two Snake models are the green boxes — they produce every classification in the response. Everything else (arithmetic checks, totaux verification, decision logic) is deterministic code. The models are the only learned components.

Interaction between the two models: doc_type gates the flow — if the document is classified as a BL or Confirmation, price anomaly detection is skipped (no prices to check). If it's a Facture or Avoir, every line goes through price_anomaly. The doc_type prediction also feeds the trust score: a high-confidence Facture classification adds 20 points, while a confused Facture/Avoir drops it.

Dataset 1: doc_type

97.1%

accuracy (140 test samples)

700

training rows × 8 features × 4 classes

Classifies the incoming document as Facture Confirmation Avoir BL

Features

Feature	Type	What it captures
`has_total_ttc`	oui/non	TTC total present — Facture and Avoir have this, Confirmation and BL don't
`has_tva`	oui/non	TVA mentioned — same split as TTC
`has_conditions_paiement`	oui/non	Payment terms (30j, comptant) — Facture's signature feature, rarely on Avoir
`has_ref_commande`	oui/non	PO reference — present across types, not discriminating alone
`has_signature`	oui/non	Signed — common on BL, not discriminating alone
`has_date_livraison`	oui/non	Delivery date or shipping info — BL's clean separator, 100% accuracy
`nb_lignes`	int	Line count — Avoir: 1-5, Facture: 3-15, BL: 1-20
`has_rib`	oui/non	Bank details — Facture only

Sample rows

{"has_total_ttc":"oui","has_tva":"oui","has_conditions_paiement":"oui","has_ref_commande":"oui","has_signature":"non","has_date_livraison":"non","nb_lignes":7,"has_rib":"oui","label":"Facture"}

{"has_total_ttc":"non","has_tva":"non","has_conditions_paiement":"non","has_ref_commande":"oui","has_signature":"oui","has_date_livraison":"oui","nb_lignes":12,"has_rib":"non","label":"BL"}

{"has_total_ttc":"oui","has_tva":"oui","has_conditions_paiement":"non","has_ref_commande":"oui","has_signature":"non","has_date_livraison":"non","nb_lignes":2,"has_rib":"non","label":"Avoir"}

What Snake learned (SAT clauses as business rules)

IF has_date_livraison = oui                                    → BL
IF has_tva = non AND has_date_livraison = non                  → Confirmation
IF has_tva = oui AND has_conditions_paiement = oui             → Facture
IF has_tva = oui AND has_conditions_paiement = non AND nb_lignes ≤ 5  → Avoir

These aren't programmed rules — they're the SAT clauses Snake constructed from data. They read like business rules because the features map to domain concepts.

Impact on the product

The doc_type prediction appears in the response at document.type.Prediction and directly affects:

Trust score: +20 points if doc_type confidence > 95%
Pipeline routing: only Factures and Avoirs trigger price anomaly checks
XAI audit: the full Snake audit trail is returned in xai.doc_type_audit

Dataset 2: price_anomaly

98.6%

accuracy (140 test samples)

700

training rows × 5 features × 3 classes

Classifies each invoice line as Normal Alerte Critique

Features

Feature	Type	Range	What it captures
`ecart_pct`	float	-60 to +60	Invoice price vs PO/reference price (%)
`ecart_historique_pct`	float	-40 to +40	Invoice price vs average of recent invoices (%)
`nb_lignes_ecart`	int	0-5	How many other lines in the same invoice also deviate
`montant_ecart`	float	0-30K	Total monetary impact: \|deviation\| × quantity
`fournisseur_fiabilite`	float	0.40-1.00	Supplier reliability score from ERP

Sample rows

{"ecart_pct":2.3,"ecart_historique_pct":1.8,"nb_lignes_ecart":0,"montant_ecart":46.0,"fournisseur_fiabilite":0.91,"label":"Normal"}

{"ecart_pct":8.7,"ecart_historique_pct":6.2,"nb_lignes_ecart":1,"montant_ecart":870.0,"fournisseur_fiabilite":0.72,"label":"Alerte"}

{"ecart_pct":-22.5,"ecart_historique_pct":-18.0,"nb_lignes_ecart":3,"montant_ecart":4500.0,"fournisseur_fiabilite":0.52,"label":"Critique"}

The learned decision boundaries

Condition	Prediction	Business meaning
\|ecart\| < 5% AND fiabilite > 0.70	Normal	Small deviation, trusted supplier — pay
\|ecart\| 5-15%	Alerte	Medium deviation — human review
\|ecart\| > 15%	Critique	Large deviation — block payment
\|ecart\| < 5% BUT fiabilite < 0.60 AND nb_lignes ≥ 2	Alerte	Small number, but pattern of deviations from an unreliable supplier
\|ecart\| 3-6% BUT fiabilite > 0.92 AND isolated	Normal	Slightly high, but top supplier with no pattern — trust it

The fiabilite interaction is the key finding: Snake learned that who sent the invoice matters, not just the numbers. A small deviation from an unreliable supplier with multiple deviating lines is more suspicious than a medium deviation from a top supplier. This was never explicitly programmed.

Impact on the product

The price_anomaly prediction appears per line in controle_lignes[].controle_prix.anomalie and drives:

The red/green table: each line gets a color — the buyer's primary visual
Alert generation: non-Normal predictions create entries in synthese.alertes
Decision: 0 alerts → Valider, 1 alert → Verifier, 2+ or Critique → Bloquer
Trust score: price_anomaly confidence contributes up to 25 points
XAI audit: per-line reasoning in xai.prix_audit

How the datasets interact

Text: "Facture GlassCorp, 200 trempes 10mm a 21.50 EUR, 150 feuilletes a 58 EUR. Normalement 21 et 54."

 1. Extraction     regex finds: 2 products, 2 prices, 2 ref prices, supplier
                   features: has_tva=non, has_ttc=non, has_rib=non, ...

 2. doc_type       Snake predicts: Facture? Confirmation? (depends on text keywords)
                   feeds: trust_score.doc_type (+0-20 pts)
                   gates: only Facture/Avoir proceed to price check

 3. price_anomaly  For each line, Snake classifies the deviation:
                   L1: ecart=+2.4%, fiabilite=0.85 → Normal
                   L2: ecart=+7.4%, fiabilite=0.85 → Alerte
                   feeds: trust_score.price_anomaly (+0-25 pts)
                   feeds: synthese.alertes, synthese.decision

 4. Trust score    extraction(17) + doc_type(20) + price(23) + completeness(15) + consistency(15)
                   = 90/100

The two models are sequential but independent — they don't share weights or features. doc_type sees document-level boolean features (has_tva, has_rib). price_anomaly sees line-level numeric features (ecart_pct, fiabilite). The trust score is the only place where their outputs combine: a weak doc_type classification AND weak price confidence = low trust, even if each model individually might be fine.

Real-life data potential

doc_type

Source: ERP document management

Volume: ~2,600 docs/year/factory

5 factories: 13,000 labeled samples/year

Labeling: free — the ERP already tags document type on entry

Real data adds: OCR noise, multilingual docs, edge cases (facture proforma, combined documents). Expected accuracy dip to ~94% then back to 97%+ with volume.

price_anomaly

Source: ERP invoice history + PO matching

Volume: ~150 lines/week/factory

5 factories: 39,000 labeled samples/year

Labeling: free — paid = Normal, reviewed = Alerte, disputed = Critique

Real data adds: product-specific thresholds, seasonal pricing, supplier patterns. The 5% boundary becomes product-aware.

Current datasets are synthetic. The architecture, features, and classification logic are production-ready.

Plug in real ERP data → retrain → the same API serves real predictions.

No code changes. Just python3 train_all.py with new NDJSON files.

Home /comprendre The Math Business

Charles Dana — Monce SAS — 2026