Organizing and Curating Image Corpora

with Tropy and Arvest

Julien Rabaud

SCD & Pôle Numérique - UPPA

March 10, 2026

– Agenda

Part 1 — Conceptual Foundations

  1. The Semantic Web
  2. Triples & URIs · RDF
  3. Metadata schemas
  4. Controlled vocabularies

Part 2 — IIIF

  1. What is IIIF?
  2. IIIF in the wild

Part 3 — Tools

  1. Tropy
  2. Arvest

Part 1
– Conceptual Foundations

– Two Webs

The Web we know

Pages written for humans

Search engines index words

Links connect documents

The Semantic Web

Data readable by machines

Links connect things

A global knowledge graph

Note

Tim Berners-Lee, 2001 — the goal: make the Web a universal medium for data exchange, not just document retrieval.

– Triples

Everything in the Semantic Web is a triple:

Subject — Predicate — Object

<Photo_001>   dc:creator             "Julien Rabaud"
<Photo_001>   dcterms:spatial        <geonames:2988507>
<Photo_001>   cidoc:P138_represents  <wd:Q5783414>

Tip

Each resource can be both subject and object — triples form an interconnected graph of knowledge.

– URIs: Naming Things

A URI is a unique, global, persistent identifier for any thing.

https://sws.geonames.org/2988507/          →  Paris
https://www.wikidata.org/entity/Q937       →  Marie Curie
http://vocab.getty.edu/page/aat/300015646  →  photographs

Unlike a URL (which locates a page), a URI names a resource — in any language, forever.

Tip

When you put a GeoNames URI in a Tropy metadata field, you’re not typing a string — you’re linking your data to a global knowledge graph.

– RDF: The Standard for Triples

Turtle — human-readable:

<https://myarchive.org/photo/001>
  dc:title   "Cloister of S. Domingo" ;
  dc:creator "Julien Rabaud" ;
  dc:date    "2024-09-15" .

JSON-LD — used by Tropy:

{
  "@id": "https://myarchive.org/photo/001",
  "dc:title": "Cloister of S. Domingo",
  "dc:creator": "Julien Rabaud"
}

Note

Tropy stores all metadata as JSON-LD — your descriptions are already Linked Data, even if you don’t think of them that way.

Metadata Schemas

How do we describe what we see?

Dublin Core

15 universal elements — usable for any resource type.

Property Usage
dc:title Title
dc:creator Photographer / author
dc:contributor Other contributors
dc:publisher Publisher / institution
dc:date Date of creation
dc:subject Topic / theme
dc:description Free-text note
dc:type Resource type
Property Usage
dc:format JPEG, TIFF…
dc:identifier Shelfmark / ID
dc:source Archive, repository
dc:language Language of visible text
dc:coverage Place or period
dc:relation Related items
dc:rights License / copyright

🔗 dublincore.org · http://purl.org/dc/elements/1.1/

– Dublin Core Terms (DCTERMS)

Richer, typed versions of DC — plus extra properties.

# Basic DC — just a string
dc:date "2024-09-15"

# DCTERMS — typed and linkable
dcterms:created   "2024-09-15"^^xsd:date
dcterms:spatial   <geonames:3117735>
dcterms:license   <https://creativecommons.org/licenses/by/4.0/>
dcterms:isPartOf  <myArchive:collection_42>

Key additions:

  • dcterms:spatial — place URI
  • dcterms:temporal — time period
  • dcterms:license — explicit URI
  • dcterms:isPartOf — collection link

Tip

Rule of thumb: use dc: when a plain string is enough — use dcterms: whenever you want to link to a URI (a place, a license, a collection). DCTerms is what makes your data truly interoperable.

– CIDOC-CRM

The international standard for cultural heritage data (Europeana, museums, major archives).

Unlike Dublin Core, CIDOC-CRM models events — not just documents.

Note

Key insight: A photograph wasn’t just taken
it was taken by someone, somewhere, at a moment, of something.

Everything important in CIDOC-CRM happens.

– CIDOC-CRM — Core Classes

Class Meaning
E22 Human-Made Object
E31 Document / Photograph
E39 Actor (person or group)
Class Meaning
E52 Time-Span
E53 Place
E65 Creation event
<Creation_001>  cidoc:P14_carried_out_by  <orcid:0000-…>
<Creation_001>  cidoc:P7_took_place_at    <geonames:3117735>
<Creation_001>  cidoc:P4_has_time-span    <2024-09-15>

Tip

In Tropy, you can import the CIDOC-CRM vocabulary and build event-oriented templates from these properties.

Which Schema for Your Template?

Your sources Recommended schema Level
General archival documents Dublin Core Terms Item
Correspondence, diaries Tropy Correspondence (DC) Item
Cultural heritage objects CIDOC-CRM Item
Visual works (art history) VRA Core Item
Technical image data EXIF Photo
Crop / detail of interest DC or custom Selection

Note

These schemas are not mutually exclusive — a single template can mix properties from several vocabularies.
Example: dc:title + dcterms:spatial + cidoc:P138_represents in the same item template.

Controlled Vocabularies

Consistent, linked values for metadata fields

– The Problem with Free Text

Typing “Paris” yourself:

  • “Paris”
  • “paris”
  • “Paris, France”
  • “Paris (France)”

→ 4 different strings, no shared meaning

Using a URI:

dcterms:spatial
  <geonames:2988507>

→ Always Paris
→ In any language
→ With coordinates
→ Linked to all other data about Paris

Note

Controlled vocabularies = shared dictionaries.
When everyone uses the same URI, datasets become interoperable.

– GeoNames

Over 12 million geographic names — all with stable URIs.

https://sws.geonames.org/2988507/  →  Paris, France
https://sws.geonames.org/3117735/  →  Salamanca, Spain
https://sws.geonames.org/6440564/  →  Anglet, France

Each entry includes names in 20+ languages, coordinates, administrative hierarchy, and feature type.

Tip

In Tropy: use GeoNames URIs in the dcterms:spatial field.
🔗 geonames.org/search.html

– Pactols

Multilingual thesaurus for Archaeology, Classical and Oriental Studies — maintained by the Frantiq network.

Covers: archaeological periods · object types · materials · ancient places · historical figures

Note

Especially relevant for CHORAL research — Romance cultures, Mediterranean and Iberian heritage, classical antiquity.

🔗 pactols.frantiq.fr

– Getty Vocabularies

Published as Linked Open Data by the Getty Research Institute — the standard reference for art history and cultural heritage.

Vocabulary Scope
AAT Art & Architecture Thesaurus — styles, materials, techniques, object types
TGN Thesaurus of Geographic Names — historical & current places
ULAN Union List of Artist Names — artists, architects, makers
vocab.getty.edu/aat/300263552
  →  oil paintings

vocab.getty.edu/tgn/7011179
  →  Salamanca

vocab.getty.edu/ulan/500010570
  →  Francisco Goya

🔗 vocab.getty.edu

– Loterre

Linked Open TERminology REsources — published by Inist-CNRS.

A multidisciplinary platform hosting 70+ scientific terminologies as SKOS/RDF Linked Open Data.

Relevant for SSH researchers:

  • Art et Archéologie
  • Ethnologie
  • Histoire et sciences des religions
  • Linguistique · Littérature
  • Géographie de l’Amérique du Nord
  • Pays et subdivisions

All vocabularies are:

  • Free to consult & download
  • Available as SKOS/RDF, JSON-LD, CSV
  • Queryable via SPARQL & REST API
  • FAIR-compliant

🔗 loterre.fr

– Openthéso (Huma-Num instance)

A platform hosting many discipline-specific thesauri — terms expressed in SKOS.

Each term has:

  • a stable ARK identifier (URI)
  • skos:prefLabel — preferred label
  • skos:altLabel — synonyms
  • skos:broader / skos:narrower
  • skos:exactMatch → links to other thesauri

Hosted thesauri include:

  • Architectural heritage (MHFA)
  • Performing arts vocabulary
  • … and many others

🔗 opentheso.huma-num.fr

Part 2 – IIIF

International Image Interoperability Framework

– What is IIIF?

Open standards for rich access to digitized images — developed since 2011 by libraries, archives, and museums worldwide.

Note

The core promise:
Any IIIF-compliant viewer can display any IIIF-compliant image — regardless of where it is hosted.

Deep zoom · cropping · structured collections · shared annotations · multi-institutional exhibitions

– Image API

Standardizes how images are served.

{server}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}

# Full image at 800px wide:
https://example.org/iiif/photo001/full/800,/0/default.jpg

# Just the top-left quarter:
https://example.org/iiif/photo001/pct:0,0,50,50/full/0/default.jpg

→ Deep zoom, cropping, resizing — all server-side, no file duplication.

– Presentation API: The Manifest

Standardizes how collections are described.

A Manifest is a JSON-LD document that groups images into a structured object.

Manifest
  └── Canvas  (= one "view" or "page")
       ├── Annotation  [painting]   →  the image
       └── Annotation  [commenting] →  your notes

Tip

A manuscript → 1 Manifest, 200 Canvases.
A photo series → 1 Manifest, 50 Canvases.
An altarpiece → 1 Manifest, front + back + details.

IIIF — Europeana

3,000+ institutions aggregated — IIIF access to millions of items.

https://iiif.europeana.eu/presentation/{collection}/{id}/manifest

Note

For CHORAL researchers:
Search for heritage from Spain, Portugal, France, Italy, Romania — manuscripts, photographs, maps, objects — then import the Manifest directly into Arvest.

🔗 europeana.eu

IIIF — Omeka S

Omeka S can auto-generates IIIF Manifests for every item with an image.

Omeka S can import IIIF Manifests and create items.

IIIF — Nakala

Nakala (Huma-Num) is a research data repository that auto-generates IIIF endpoints for all deposited images.

  1. Deposit your images → get a DOI + IIIF URL
  2. Import the Manifest in Arvest
  3. Annotate with collaborators
  4. Export for publication or ML

Note

This is a complete, FAIR-compliant workflow for image corpora in the humanities.

🔗 nakala.fr

Part 3
– Tools for Working with Image Corpora

Tropy

From archival photos to structured, linked research data

– What is Tropy?

A free, open-source desktop app for organizing and describing research photographs.

Not a generic photo manager — built specifically for historians, art historians, and humanists.

Created by:

  • RRCHNM — Roy Rosenzweig Center for History and New Media (George Mason U.)
  • C²DH — Luxembourg Centre for Contemporary and Digital History

First release: 2017 · License: AGPL-3.0

– Tropy is NOT…

  • A photo editor (Photoshop, Lightroom…)

  • A reference manager (that’s Zotero)

  • A writing platform

  • An online publishing platform (that’s Omeka)

✅ The missing link between your camera roll and your structured research data.

– The Core Workflow

  1. Import photos from your archival sessions — including PDFs
  2. Group related photos into items
  3. Describe with rich, linked metadata
  4. Annotate regions of interest
  5. Export as JSON-LD, CSV, or to Omeka S

Tip

One Tropy project per research trip or archive collection.

– The Tropy Interface

4 main panels:

  • Project (left) — lists, tags, saved searches
  • Item grid (center) — browse all items
  • Metadata panel (right) — describe the selected item
  • Viewer — view and annotate photos

Two project modes:

Mode Behaviour
Standard Copies files → portable
Advanced Links to originals → lighter

For archival work: Standard is safer.

Tip

Before your first import, set a default template in
Edit → Preferences → Settings — all imported items will use it automatically.

– Three Levels of Description

📁 Item — the primary unit

Groups logically related photos. Example: a document photographed recto/verso = 1 item, 2 photos.

– Three Levels of Description

🖼️ Photo — one image file

Metadata: filename, dimensions, date taken.

You rarely describe photos individually — the item is the primary unit of description.

– EXIF: Metadata Already in the File

Some metadata is embedded in the image file itself by the camera or scanner — no description needed.

Tropy extracts automatically:

  • Filename
  • Date & time of capture
  • Dimensions (pixels)
  • File size (bytes)

Available if GPS was on:

  • Latitude / longitude

The key distinction:

Who writes it? What kind?
EXIF The device Technical
DC / CIDOC The researcher Semantic

EXIF describes how the photo was made.
DC describes what it shows.

Tip

In Tropy, EXIF properties belong in a photo-level template — not an item template. This lets you surface technical data (camera model, GPS, resolution) alongside your semantic description.

– Three Levels of Description

🔍 Selection — a cropped region

A seal, a signature, an inscription, a motif.

Has its own title, notes, and tags — linked to pixel coordinates in the image.

– Templates

A template defines which metadata fields appear for an item — each field is a property from a vocabulary.

Built-in:

  • Tropy Generic — Dublin Core
  • Tropy Correspondence — letters
  • Tropy Photo — photo-level metadata

Also available (import):

  • CIDOC-CRM · VRA Core · Schema.org

Custom / imported:

Edit → Preferences → Vocabularies → Import

Any RDF/OWL schema (JSON-LD or Turtle) → then use those properties in your templates.

– Plugins

Import

  • CSV Import
  • Omeka S Import
  • IIIF Import

Export

  • Omeka S Export
  • CSV Export
  • CSL / Zotero Export

🔗 tropy.org

Experimental: tropy-plugin-nakala by Bruno Morandière

– Learning Tropy

English:

📚 docs.tropy.org
▶️ vimeo.com/user104478141
▶️ youtube.com/@tropy
💬 forums.tropy.org

– Hands-On: Tropy

We will explore together:

  1. Create a new project and import a folder of photos
  2. Create an item from multiple photos (recto / verso)
  3. Apply a template and fill in metadata
  4. Create a selection on a detail
  5. Use tags and saved searches
  6. Export as JSON-LD

Beyond Description — Visualizing Your Corpus

Once your images are described and structured, new possibilities open up.

VIKUS Viewer — developed at FH Potsdam’s Urban Complexity Lab — arranges thousands of cultural artifacts on a dynamic canvas, letting you explore thematic and temporal patterns across an entire collection at a glance.

  • Items positioned along a timeline
  • Keywords visualized as an interactive frequency map
  • Zoom into high-resolution textures
  • Runs in the browser — no installation

Note

I won’t go into this further today — but it’s a beautiful example of what a well-described corpus enables.

Once you’ve done the work in Tropy, tools like this become possible.

🔗 vikusviewer.fh-potsdam.de

Arvest

Annotate, collaborate, expose IIIF collections

– What is Arvest?

A web platform for working with IIIF image collections — no installation needed.

  • Import local images or IIIF Manifests
  • Annotate with the W3C Web Annotation standard
  • Collaborate with shared workspaces
  • Expose data via a REST API
  • Export for machine learning pipelines

🔗 arvest.app

– Importing — Local Files

  1. Create a workspace
  2. Import → Upload files
  3. Arvest generates a IIIF Manifest automatically

Supported formats: JPEG, PNG, TIFF, WebP

– Importing — IIIF Manifest

  1. Find a Manifest URL (Europeana, Nakala, Omeka S, Gallica…)
  2. Import → IIIF Manifest URL
  3. Paste → all canvases and metadata are imported
https://gallica.bnf.fr/iiif/ark:/12148/btv1b8452439r/manifest.json

– Annotating

Arvest uses the W3C Web Annotation standard.

Annotation types:

  • Region — box or polygon
  • Point — a specific location
  • Full-canvas — note on the whole image

Each annotation has:

  • a body (text, tag, or URI)
  • a motivation
  • creator + date metadata

– Uses in Humanities Research

  • Identify depicted persons, places, objects
  • Transcribe visible text or inscriptions
  • Tag iconographic themes
  • Link regions to controlled vocabulary URIs
  • Compare motifs across a corpus

– Collaborating

Shared workspaces:

  • Invite collaborators by email
  • Roles: viewer · annotator · editor · admin
  • All annotations visible to the team
  • Comment threads · activity log

Tip

For CHORAL:
Invite partners across institutions, annotate the same corpus collaboratively — across national borders.

– Annotation Workflow

  1. Lead researcher imports corpus + sets guidelines
  2. Collaborators annotate independently
  3. Review — conflicts flagged
  4. Consensus — annotations finalized
  5. Export for publication or ML

– Exposing Data via API

GET /api/v1/workspaces/{id}/manifests
GET /api/v1/workspaces/{id}/annotations
GET /api/v1/manifests/{id}/canvas/{n}

→ Your annotated corpus becomes a queryable dataset

Export formats for Machine Learning:

  • COCO — object detection
  • CSV / JSON — text classification
  • IIIF + W3C — ML pipelines

Note

Annotate once → reuse everywhere.

– The Full Workflow

flowchart LR
    A["📷 Archival<br/>Photos"] --> B["Tropy<br/>Organize & Describe"]
    B -- "JSON-LD" --> C["Omeka S<br/>or Nakala"]
    C -- "IIIF Manifest" --> D["Arvest<br/>Annotate & Collaborate"]
    D -- "API / Export" --> E["Publication<br/>or ML"]
    F["Europeana · Gallica<br/>Other IIIF sources"] --> D
    G["GeoNames · Pactols<br/>Openthéso"] --> B & D

Summary

Key principles:

  • Use URIs instead of free text
  • Choose a metadata schema suited to your sources
  • Link to controlled vocabularies
  • IIIF is the interoperability layer for images

Your next steps:

  1. Install Tropy
  2. Register on Arvest
  3. Find IIIF content on Europeana
  4. Deposit images in Nakala
  5. Create a collaborative workspace

*

Questions?

Julien Rabaud · UPPA · SCD · Pôle Numérique

📧 julien.rabaud@univ-pau.fr

ujubib.github.io/ed-tropy-arvest/slides.html

References

Semantic Web & RDF: w3.org/TR/rdf11-primer

Dublin Core: dublincore.org

CIDOC-CRM: cidoc-crm.org

GeoNames: geonames.org

Pactols / Openthéso: pactols.frantiq.fr · opentheso.huma-num.fr

IIIF: iiif.io · iiif.io/api/cookbook

Tropy: docs.tropy.org · forums.tropy.org

Nakala: nakala.fr

Arvest: arvest.app

VIKUS Viewer: vikusviewer.fh-potsdam.de