Data Model¶
Overview¶
The data model is designed to build a structured registry of boats and their associated entities.
While regatta results are used as the primary source of information, the goal of the database is not to store detailed race results. Instead, regatta appearances act as a discovery mechanism for identifying boats and extracting structured information about them.
The canonical dataset therefore focuses on boats and their relationships, including owners, clubs, and classes.
The model follows a layered structure:
Raw Data → ETL Pipelines → Canonical Entities
The canonical dataset is exposed through a structured API, enabling exploration and integration with external applications.¶
Raw Data Layer¶
Raw data is populated through ingestion pipelines and serves as the entry point for all downstream transformations.
Raw scraped data is stored in the yacht_raw schema.
The table:
stores the extracted information for each regatta page as a JSON object.
Each record typically includes:
- source metadata
- regatta name
- year
- scraped timestamp
- structured raw data (
jsonb)
This layer preserves the original extracted values and allows all transformations to be reproducible and traceable.
Raw data is never modified or deleted.
Canonical Entity Model¶
The canonical schema (yacht_db) represents the structured dataset used by the system.
The model is centred around a set of core entities.
The canonical dataset is actively used by the API layer to provide structured access to boats and their relationships.
Boats¶
The boats table represents the canonical identity of a boat.
Each boat is defined by:
- boat name
- boat identifier (typically a sail number)
- associated boat class
Boat identity is resolved during the normalisation process using:
- cleaned sail numbers
- fuzzy name matching
- manual review where required
This allows the system to recognise the same boat appearing across multiple regattas even when naming conventions differ.
Owners¶
The owners table stores the canonical identities of boat owners.
A boat may have multiple owners.
The relationship is represented through the junction table:
which allows a many-to-many relationship between boats and owners.
Clubs¶
The clubs table represents sailing clubs associated with boats.
Club records may include additional metadata such as:
- location
- short name
- estimated number of members
Boats may be affiliated with multiple clubs, represented through the junction table:
Boat Classes¶
The boat_classes table represents the general class of a boat, such as:
- Dragon
- Etchells
- J/70
Classes may include metadata such as:
- manufacturer
- rating rule
- crew size
- hull length
Boat Types¶
The boat_type table provides a more specific classification within a class, representing particular boat models or variants.
This allows the system to distinguish between boats that belong to the same class but have different design variants.
Regattas and Editions¶
Regattas are represented by two related tables:
The regattas table represents the event itself, while regatta_editions represents a specific year of that event.
For example:
This separation avoids duplication of event metadata across years.
Boat Participation¶
The table:
records the participation of boats in specific regatta editions.
This information is used primarily as contextual metadata and as a discovery signal for identifying boats and their relationships.
The system does not aim to store full race result analytics at this stage, focusing instead on entity discovery and relationship mapping.
Locations¶
The locations table provides geographic metadata used by both clubs and regattas.
Typical attributes include:
- city
- region
- country
This allows geographic grouping and analysis of clubs and events.
Entity Relationships¶
The model contains several many-to-many relationships:
Boat ↔ Owner Boat ↔ Club Boat ↔ Regatta Edition Edition ↔ Class
These relationships are implemented using junction tables to preserve flexibility and avoid duplication.
Normalisation Workflow¶
Raw values extracted from source pages are rarely consistent.
Normalisation currently follows a hybrid workflow:
- Raw values are extracted from the JSON ingestion layer.
- Lists of unique raw values are generated.
- These values are reviewed and mapped to canonical entities.
- Canonical entities are inserted into the database.
Mapping rules are currently stored in CSV lookup files but will eventually be migrated into database-managed mapping tables.
Normalisation is integrated into the ETL pipeline layer, enabling consistent transformation and insertion into the canonical database.
Data Usage¶
The canonical data model is designed to support:
- API-based exploration of boats and regattas
- relationship analysis between entities
- future frontend applications
- incremental enrichment of the dataset over time
The model prioritises flexibility and extensibility to support future analytical and product layers.
Design Principles¶
The data model follows several key principles:
Entity-centric design The system models boats and related entities rather than race results.
Traceability All canonical entities can be traced back to raw source data.
Flexibility Many-to-many relationships allow complex ownership and affiliation structures.
Incremental normalisation Data quality improves over time through iterative normalisation.
Separation of raw and canonical layers Raw scraped data is stored independently from the canonical relational model.