Training Data

Two datasets. Two Snake models. Every prediction traces back to here.

How data flows through the product

Text in
Extraction (Haiku or regex)
doc_type model
price_anomaly model (per line)
Decision + trust score

The extraction step turns free text into features. The two Snake models are the green boxes — they produce every classification in the response. Everything else (arithmetic checks, totaux verification, decision logic) is deterministic code. The models are the only learned components.

Interaction between the two models: doc_type gates the flow — if the document is classified as a BL or Confirmation, price anomaly detection is skipped (no prices to check). If it's a Facture or Avoir, every line goes through price_anomaly. The doc_type prediction also feeds the trust score: a high-confidence Facture classification adds 20 points, while a confused Facture/Avoir drops it.

Dataset 1: doc_type

97.1%
accuracy (140 test samples)
700
training rows × 8 features × 4 classes

Classifies the incoming document as Facture Confirmation Avoir BL

Features

FeatureTypeWhat it captures
has_total_ttcoui/nonTTC total present — Facture and Avoir have this, Confirmation and BL don't
has_tvaoui/nonTVA mentioned — same split as TTC
has_conditions_paiementoui/nonPayment terms (30j, comptant) — Facture's signature feature, rarely on Avoir
has_ref_commandeoui/nonPO reference — present across types, not discriminating alone
has_signatureoui/nonSigned — common on BL, not discriminating alone
has_date_livraisonoui/nonDelivery date or shipping info — BL's clean separator, 100% accuracy
nb_lignesintLine count — Avoir: 1-5, Facture: 3-15, BL: 1-20
has_riboui/nonBank details — Facture only

Sample rows

{"has_total_ttc":"oui","has_tva":"oui","has_conditions_paiement":"oui","has_ref_commande":"oui","has_signature":"non","has_date_livraison":"non","nb_lignes":7,"has_rib":"oui","label":"Facture"}
{"has_total_ttc":"non","has_tva":"non","has_conditions_paiement":"non","has_ref_commande":"oui","has_signature":"oui","has_date_livraison":"oui","nb_lignes":12,"has_rib":"non","label":"BL"}
{"has_total_ttc":"oui","has_tva":"oui","has_conditions_paiement":"non","has_ref_commande":"oui","has_signature":"non","has_date_livraison":"non","nb_lignes":2,"has_rib":"non","label":"Avoir"}

What Snake learned (SAT clauses as business rules)

IF has_date_livraison = oui                                    → BL
IF has_tva = non AND has_date_livraison = non                  → Confirmation
IF has_tva = oui AND has_conditions_paiement = oui             → Facture
IF has_tva = oui AND has_conditions_paiement = non AND nb_lignes ≤ 5  → Avoir

These aren't programmed rules — they're the SAT clauses Snake constructed from data. They read like business rules because the features map to domain concepts.

Impact on the product

The doc_type prediction appears in the response at document.type.Prediction and directly affects:

Dataset 2: price_anomaly

98.6%
accuracy (140 test samples)
700
training rows × 5 features × 3 classes

Classifies each invoice line as Normal Alerte Critique

Features

FeatureTypeRangeWhat it captures
ecart_pctfloat-60 to +60Invoice price vs PO/reference price (%)
ecart_historique_pctfloat-40 to +40Invoice price vs average of recent invoices (%)
nb_lignes_ecartint0-5How many other lines in the same invoice also deviate
montant_ecartfloat0-30KTotal monetary impact: |deviation| × quantity
fournisseur_fiabilitefloat0.40-1.00Supplier reliability score from ERP

Sample rows

{"ecart_pct":2.3,"ecart_historique_pct":1.8,"nb_lignes_ecart":0,"montant_ecart":46.0,"fournisseur_fiabilite":0.91,"label":"Normal"}
{"ecart_pct":8.7,"ecart_historique_pct":6.2,"nb_lignes_ecart":1,"montant_ecart":870.0,"fournisseur_fiabilite":0.72,"label":"Alerte"}
{"ecart_pct":-22.5,"ecart_historique_pct":-18.0,"nb_lignes_ecart":3,"montant_ecart":4500.0,"fournisseur_fiabilite":0.52,"label":"Critique"}

The learned decision boundaries

ConditionPredictionBusiness meaning
|ecart| < 5% AND fiabilite > 0.70NormalSmall deviation, trusted supplier — pay
|ecart| 5-15%AlerteMedium deviation — human review
|ecart| > 15%CritiqueLarge deviation — block payment
|ecart| < 5% BUT fiabilite < 0.60 AND nb_lignes ≥ 2AlerteSmall number, but pattern of deviations from an unreliable supplier
|ecart| 3-6% BUT fiabilite > 0.92 AND isolatedNormalSlightly high, but top supplier with no pattern — trust it

The fiabilite interaction is the key finding: Snake learned that who sent the invoice matters, not just the numbers. A small deviation from an unreliable supplier with multiple deviating lines is more suspicious than a medium deviation from a top supplier. This was never explicitly programmed.

Impact on the product

The price_anomaly prediction appears per line in controle_lignes[].controle_prix.anomalie and drives:

How the datasets interact

Text: "Facture GlassCorp, 200 trempes 10mm a 21.50 EUR, 150 feuilletes a 58 EUR. Normalement 21 et 54."

 1. Extraction     regex finds: 2 products, 2 prices, 2 ref prices, supplier
                   features: has_tva=non, has_ttc=non, has_rib=non, ...

 2. doc_type       Snake predicts: Facture? Confirmation? (depends on text keywords)
                   feeds: trust_score.doc_type (+0-20 pts)
                   gates: only Facture/Avoir proceed to price check

 3. price_anomaly  For each line, Snake classifies the deviation:
                   L1: ecart=+2.4%, fiabilite=0.85 → Normal
                   L2: ecart=+7.4%, fiabilite=0.85 → Alerte
                   feeds: trust_score.price_anomaly (+0-25 pts)
                   feeds: synthese.alertes, synthese.decision

 4. Trust score    extraction(17) + doc_type(20) + price(23) + completeness(15) + consistency(15)
                   = 90/100

The two models are sequential but independent — they don't share weights or features. doc_type sees document-level boolean features (has_tva, has_rib). price_anomaly sees line-level numeric features (ecart_pct, fiabilite). The trust score is the only place where their outputs combine: a weak doc_type classification AND weak price confidence = low trust, even if each model individually might be fine.

Real-life data potential

doc_type

Source: ERP document management

Volume: ~2,600 docs/year/factory

5 factories: 13,000 labeled samples/year

Labeling: free — the ERP already tags document type on entry

Real data adds: OCR noise, multilingual docs, edge cases (facture proforma, combined documents). Expected accuracy dip to ~94% then back to 97%+ with volume.

price_anomaly

Source: ERP invoice history + PO matching

Volume: ~150 lines/week/factory

5 factories: 39,000 labeled samples/year

Labeling: free — paid = Normal, reviewed = Alerte, disputed = Critique

Real data adds: product-specific thresholds, seasonal pricing, supplier patterns. The 5% boundary becomes product-aware.

Current datasets are synthetic. The architecture, features, and classification logic are production-ready.

Plug in real ERP data → retrain → the same API serves real predictions.

No code changes. Just python3 train_all.py with new NDJSON files.

Charles Dana — Monce SAS — 2026