Institutional sourc...
 
Notifications
Clear all

Institutional sources for forensic lexicons — what have you used?

4 Posts
3 Users
0 Reactions
1,464 Views
(@antonsen)
New Member
Joined: 2 months ago
Posts: 2
Topic starter   [#20670]

Building lexicons for multilingual forensic chat analysis, I keep running into the same gap: most off-the-shelf NLP corpora are general-purpose, while the specialised vocabulary of investigation — drug slang, weapon nomenclature, coercive-control patterns, child-exploitation terminology — sits in institutional documents that are often not indexed and harder to find than they should be.

For a few European languages I've been able to anchor lexicons in solid institutional sources (examples):

  • Swedish: Brottsförebyggande rådet (Brå) for drug and violence terminology
  • Dutch: Jellinek straatnamen and Trimbos-instituut for drug-related slang and clinical terms
  • Italian: DCSA's Universo droga glossary for source-cited drug terms (~150 entries)
  • Bulgarian: terminology curated by PIC Stara Zagora
  • Norwegian: Kripos publications
  • English (UK/EU context): Internet Watch Foundation glossaries; Europol IOCTA reports for cross-jurisdictional reference.

The property each of these has is what matters most for forensic use: a finding can be traced back to a citable institutional source rather than depending on undocumented author judgement. A smaller lexicon with proper citations is more useful in an evidentiary context than a larger one with crowd-sourced terms only.

Where I'm thinner and actively looking for input: Romanian, Greek, Czech and Hungarian. I have working lexicons for several of these but they currently rely on academic-paper terms and crowd sources rather than institutional ones — which I'd prefer to anchor properly before relying on them in casework contexts.

Two questions to the community:

  1. For the languages above (or any other under-served European language), are there institutional or government sources for specialised forensic terminology — drug slang, IPV/coercive-control vocabulary, child-exploitation glossaries — that you've found genuinely useful and citable?
  2. More broadly: how do other DFIR practitioners handle the trade-off between lexicon breadth and source citability when working in languages where institutional sources are thin?

Happy to compare notes if anyone is doing similar work in this space.

/Andreas



   
Quote
(@torenre)
New Member
Joined: 2 months ago
Posts: 1
 

The evidentiary traceability point is critical, especially in multilingual DFIR. For Romanian and Czech, you might have better luck digging through national police training materials, court glossaries, or NGO documentation tied to trafficking and IPV rather than pure academic corpora. How are you versioning and preserving provenance when slang evolves or meanings shift over time?

 



   
ReplyQuote
(@ilyacolton)
New Member
Joined: 2 months ago
Posts: 1
 

Are there any institutional or government sources for specialized forensic terminology (e.g., drug slang, IPV/coercive-control vocabulary, child-exploitation glossaries) in the languages mentioned? Recommendations for genuinely useful and citable resources would be valuable.



   
ReplyQuote
(@antonsen)
New Member
Joined: 2 months ago
Posts: 2
Topic starter  

Thanks both.

@[Reply 1] — the police-training / court-glossary / NGO reframe is the right one. I'd been treating "institutional" too narrowly, mostly looking for the government-published equivalent of DCSA's *Universo droga* or Trimbos' Tiplijst. That format exists where the ministerial drug agencies have published it; where they haven't, training docs, trafficking NGO reports, and IPV victim-support material cover the same ground in a different shape. Useful framing — I'll widen the search.

On the **versioning / provenance-over-time** question, three things, in increasing order of how well they actually work:

1. **Git-versioned lexicon files with per-term source URLs and last-
reviewed dates in the header.** Each finding the system produces
carries the lexicon version it matched against, so a finding from
2025-Q3 can be defended against the lexicon as it existed then —
not against today's. That's the evidentiary part.

2. **New entries rather than overwrites when a term shifts.** A
shifted euphemism lands as a new entry with its own source; the
old entry stays unless explicitly deprecated. Provenance over
time follows from this.

3. **Architectural — don't lean on the lexicon alone.** Vocabulary
drift is fast in CSAM-distribution and drug forums, so layered
detection matters more than current vocabulary: lexical
indicators with provenance, contextual co-occurrence, temporal
behavioural shifts, cross-conversation persistence. The lexicon
layer is the most citable but also the first to degrade under
adaptation. The other layers don't depend on the specific
vocabulary surviving.

Appreciate the pointers — especially on the NGO/training-material angle.


This post was modified 1 month ago by antonsen

   
ReplyQuote
Share: