Project source-tree
*******************

Below is the layout of the project (to 10 levels), followed by the
contents of each key file.

Project directory layout

   licence-normaliser/
   ├── scripts
   │   ├── __init__.py
   │   ├── check_missing_aliases.py
   │   ├── compare_datasets.py
   │   ├── README.rst
   │   └── test_name_inference.py
   ├── src
   │   └── licence_normaliser
   │       ├── cli
   │       │   ├── __init__.py
   │       │   └── _main.py
   │       ├── data
   │       │   ├── aliases
   │       │   │   └── aliases.json
   │       │   ├── prose
   │       │   │   └── prose_patterns.json
   │       │   ├── publishers
   │       │   │   └── publishers.json
   │       │   ├── urls
   │       │   │   └── url_map.json
   │       │   └── README.rst
   │       ├── parsers
   │       │   ├── __init__.py
   │       │   ├── alias.py
   │       │   ├── creativecommons.py
   │       │   ├── opendefinition.py
   │       │   ├── osi.py
   │       │   ├── prose.py
   │       │   ├── publisher.py
   │       │   ├── scancode_licensedb.py
   │       │   └── spdx.py
   │       ├── tests
   │       │   ├── __init__.py
   │       │   ├── conftest.py
   │       │   ├── test_aliases.py
   │       │   ├── test_cache.py
   │       │   ├── test_cli.py
   │       │   ├── test_core.py
   │       │   ├── test_exceptions.py
   │       │   ├── test_integration.py
   │       │   ├── test_models.py
   │       │   ├── test_prose.py
   │       │   └── test_publisher.py
   │       ├── __init__.py
   │       ├── _cache.py
   │       ├── _core.py
   │       ├── _models.py
   │       ├── _normaliser.py
   │       ├── _trace.py
   │       ├── defaults.py
   │       ├── exceptions.py
   │       ├── plugins.py
   │       └── py.typed
   ├── AGENTS.md
   ├── conftest.py
   ├── CONTRIBUTING.rst
   ├── docker-compose.yml
   ├── Dockerfile
   ├── Makefile
   ├── pyproject.toml
   ├── README.rst
   └── tox.ini


README.rst
==========

README.rst

   ==================
   licence-normaliser
   ==================

   .. image:: https://raw.githubusercontent.com/barseghyanartur/licence-normaliser/main/docs/_static/licence_normaliser_logo.webp
      :alt: licence-normaliser logo
      :align: center

   Comprehensive license normalsation with a three-level hierarchy.

   .. image:: https://img.shields.io/pypi/v/licence-normaliser.svg
      :target: https://pypi.python.org/pypi/licence-normaliser
      :alt: PyPI Version

   .. image:: https://img.shields.io/pypi/pyversions/licence-normaliser.svg
      :target: https://pypi.python.org/pypi/licence-normaliser/
      :alt: Supported Python versions

   .. image:: https://github.com/barseghyanartur/licence-normaliser/actions/workflows/test.yml/badge.svg?branch=main
      :target: https://github.com/barseghyanartur/licence-normaliser/actions
      :alt: Build Status

   .. image:: https://readthedocs.org/projects/licence-normaliser/badge/?version=latest
       :target: http://licence-normaliser.readthedocs.io
       :alt: Documentation Status

   .. image:: https://img.shields.io/badge/docs-llms.txt-blue
       :target: https://licence-normaliser.readthedocs.io/en/latest/llms.txt
       :alt: llms.txt - documentation for LLMs

   .. image:: https://img.shields.io/badge/license-MIT-blue.svg
      :target: https://github.com/barseghyanartur/licence-normaliser/#License
      :alt: MIT

   .. image:: https://coveralls.io/repos/github/barseghyanartur/licence-normaliser/badge.svg?branch=main&service=github
       :target: https://coveralls.io/github/barseghyanartur/licence-normaliser?branch=main
       :alt: Coverage

   ``licence-normaliser`` is a comprehensive license normalisation library that
   maps any license representation (SPDX tokens, URLs, prose descriptions) to a
   canonical three-level hierarchy.

   Features
   ========

   - **Three-level hierarchy** - LicenseFamily → LicenseName → LicenseVersion.
   - **Wide format support** - SPDX tokens, URLs, prose descriptions.
   - **Creative Commons support** - Full CC family with versions and IGO variants.
   - **Publisher-specific licenses** - Springer, Nature, Elsevier, Wiley, ACS,
     and more.
   - **File-driven data** - Add aliases, URLs, and patterns by editing JSON files.
     No Python code changes required for new synonyms.
   - **Pluggable parsers** - Drop in a new parser class to ingest
     any external license registry. Parsers implement plugin interfaces
     (``RegistryPlugin``, ``URLPlugin``, etc.).
   - **Strict mode** - Raise ``LicenseNotFoundError`` instead of silently
     returning ``"unknown"``.
   - **Caching** - LRU caching for performance.
   - **CLI** - Command-line interface with ``--strict`` and ``--explain`` support.

   Hierarchy
   =========

   The library uses a three-level hierarchy:

   1. **LicenseFamily** - broad bucket: ``"cc"``, ``"osi"``, ``"copyleft"``,
      ``"publisher-tdm"``, ...
   2. **LicenseName** - version-free: ``"cc-by"``, ``"cc-by-nc-nd"``, ``"mit"``,
      ``"wiley-tdm"``
   3. **LicenseVersion** - fully resolved: ``"cc-by-3.0"``, ``"cc-by-nc-nd-4.0"``

   Installation
   ============

   With ``uv``:

   .. code-block:: sh

       uv pip install licence-normaliser

   Or with ``pip``:

   .. code-block:: sh

       pip install licence-normaliser

   Quick start
   ===========

   .. code-block:: python
       :name: test_quick_start

       from licence_normaliser import normalise_license

       v = normalise_license("CC BY-NC-ND 4.0")
       str(v)                  # "cc-by-nc-nd-4.0"   ← LicenseVersion
       str(v.license)          # "cc-by-nc-nd"       ← LicenseName
       str(v.license.family)   # "cc"                ← LicenseFamily

   Strict mode
   ===========

   By default, unresolvable inputs return an ``"unknown"`` result.  Pass
   ``strict=True`` to raise ``LicenseNotFoundError`` instead:

   .. code-block:: python
       :name: test_strict_mode

       from licence_normaliser import normalise_license
       from licence_normaliser.exceptions import LicenseNotFoundError

       # Silent fallback (default)
       v = normalise_license("some-unknown-string")
       v.family.key  # "unknown"

       # Strict: raises on unresolvable input
       try:
           v = normalise_license("some-unknown-string", strict=True)
       except LicenseNotFoundError as exc:
           print(exc.raw)      # original input
           print(exc.cleaned)  # cleaned form that failed lookup

   Trace / Explain
   ===============

   Set ``ENABLE_LICENCE_NORMALISER_TRACE=1`` or pass ``trace=True`` to get
   resolution traces showing how the license was matched:

   .. code-block:: python
       :name: test_trace

       from licence_normaliser import normalise_license

       # Via function
       v = normalise_license("cc by-nc-nd 3.0 igo", trace=True)
       print(v.explain())

       # Via class
       from licence_normaliser import LicenseNormaliser
       ln = LicenseNormaliser(trace=True)
       v = ln.normalise_license("MIT")
       print(v.explain())

   Output shows the resolution pipeline (alias → registry → url → prose →
   fallback) and which source file + line matched:

   .. code-block:: text

       Input: 'cc by-nc-nd 3.0 igo' → 'cc by-nc-nd 3.0 igo'
         [✓] alias: 'cc by-nc-nd 3.0 igo' → 'cc-by-nc-nd-3.0-igo' (line 139 in aliases.json)

       Result:
         version_key: 'cc-by-nc-nd-3.0-igo'
         name_key: 'cc-by-nc-nd'
         family_key: 'cc'

   The trace can also be accessed via ``v._trace`` for programmatic use.

   Batch normalisation
   ===================

   .. code-block:: python
       :name: test_batch_normalisation

       from licence_normaliser import normalise_licenses

       results = normalise_licenses(["MIT", "Apache-2.0", "CC BY 4.0"])
       for r in results:
           print(r.key)

       # Strict batch - raises on first unresolvable
       results = normalise_licenses(["MIT", "Apache-2.0"], strict=True)

   Custom plugins
   ==============

   The ``LicenseNormaliser`` class lets you inject custom plugin classes for
   specialised use cases:

   .. code-block:: python
       :name: test_custom_plugins

       from licence_normaliser import LicenseNormaliser
       from licence_normaliser.parsers.alias import AliasParser
       from licence_normaliser.parsers.spdx import SPDXParser

       # Use only SPDX + Alias plugins (no CC, no publisher URLs)
       ln = LicenseNormaliser(
           registry=[SPDXParser],
           alias=[AliasParser],
           family=[AliasParser],
           name=[AliasParser],
           cache=True,
           cache_maxsize=8192,
       )

       # MIT resolves via SPDX parser
       assert str(ln.normalise_license("MIT")) == "mit"

       # CC BY resolves via Alias
       assert str(ln.normalise_license("CC BY-NC-ND 4.0")) == "cc-by-nc-nd-4.0"

   .. note::

       Explicit plugin passing is optional — ``LicenseNormaliser()``
       automatically loads defaults. Use the pattern above only if you need
       custom plugins or reduce number of plugins loaded.

   For caching, ``LicenseNormaliser`` wraps the resolution method
   with ``lru_cache``.
   Disable it by passing ``cache=False`` for debugging:

   .. code-block:: python
       :name: test_caching

       from licence_normaliser import LicenseNormaliser

       ln = LicenseNormaliser(cache=False)
       result = ln.normalise_license("MIT")

   Update data (CLI)
   =================

   .. code-block:: sh

       licence-normaliser update-data --force
       # Fetches fresh SPDX, OpenDefinition, OSI, CreativeCommons, and ScanCode JSONs

   Integration tests (public API only)
   ===================================

   All integration tests live in
   ``src/licence_normaliser/tests/test_integration.py``
   and only import the public API.

   CLI usage
   =========

   Normalise a single license:

   .. code-block:: sh

       licence-normaliser normalise "MIT"
       # Output: mit

       licence-normaliser normalise --full "CC BY 4.0"
       # Output:
       # Key: cc-by-4.0
       # URL: https://creativecommons.org/licenses/by/4.0/
       # License: cc-by
       # Family: cc

       licence-normaliser normalise --strict "totally-unknown"
       # Exits with code 1 and prints an error

   Batch normalise:

   .. code-block:: sh

       licence-normaliser batch MIT "Apache-2.0" "CC BY 4.0"
       licence-normaliser batch --strict MIT "Apache-2.0"

   Exceptions
   ==========

   .. code-block:: python
       :name: test_exceptions

       from licence_normaliser.exceptions import (
           LicenseNormaliserError,   # base class
           LicenseNotFoundError,     # raised by strict mode
       )

   Testing
   =======

   All tests run inside Docker:

   .. code-block:: sh

       make test

   To test a specific Python version:

   .. code-block:: sh

       make test-env ENV=py312

   License
   =======

   MIT

   Author
   ======

   Artur Barseghyan <artur.barseghyan@gmail.com>


CONTRIBUTING.rst
================

CONTRIBUTING.rst

   ======================
   Contributor guidelines
   ======================

   .. _licence-normaliser: https://github.com/barseghyanartur/licence-normaliser/
   .. _uv: https://docs.astral.sh/uv/
   .. _tox: https://tox.wiki
   .. _ruff: https://beta.ruff.rs/docs/
   .. _doc8: https://doc8.readthedocs.io/
   .. _pre-commit: https://pre-commit.com/#installation
   .. _issues: https://github.com/barseghyanartur/licence-normaliser/issues
   .. _discussions: https://github.com/barseghyanartur/licence-normaliser/discussions
   .. _pull request: https://github.com/barseghyanartur/licence-normaliser/pulls
   .. _versions manifest: https://github.com/actions/python-versions/blob/main/versions-manifest.json

   Developer prerequisites
   -----------------------

   pre-commit
   ~~~~~~~~~~

   Refer to `pre-commit`_ for installation instructions.

   TL;DR:

   .. code-block:: sh

       curl -LsSf https://astral.sh/uv/install.sh | sh  # Install uv
       uv tool install pre-commit                        # Install pre-commit
       pre-commit install                                # Install hooks

   Installing `pre-commit`_ ensures all contributions adhere to the project's
   code quality standards.

   Code standards
   --------------

   `ruff`_ and `doc8`_ are triggered automatically by `pre-commit`_.

   To run checks manually:

   .. code-block:: sh

       make doc8
       make ruff

   Import conventions
   ~~~~~~~~~~~~~~~~~~

   **Import statements belong at module level.** Avoid placing imports inside
   functions or methods unless absolutely necessary:

   - **Acceptable exceptions:**

     - Breaking circular dependencies
     - Optional runtime dependencies (e.g., CLI-only imports)
     - Heavy imports that are rarely used

   - **Why this matters:**

     - Improves code readability
     - Makes dependencies explicit and discoverable
     - Enables static analysis tools to work correctly
     - Follows Python community best practices (PEP 8)

   When in doubt, place imports at the top of the file.

   Virtual environment
   -------------------

   .. code-block:: sh

       make create-venv

   Installation
   ------------

   .. code-block:: sh

       make install

   Testing
   -------

   .. note::
      Python 3.15 is being tested on GitHub CI, but not inside a local Docker image.

   Docker-based testing (recommended)
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   All tests run inside Docker for platform independence and consistency:

   .. code-block:: sh

       make test                    # full matrix (Python 3.10-3.14)
       make test-env ENV=py312      # single Python version
       make shell                   # interactive shell in test container
       make shell-env ENV=py312     # interactive shell for specific Python

   Local testing (alternative)
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~

   For faster iteration during development, you can run tests locally with ``uv``:

   .. code-block:: sh

       make install                 # one-time setup
       uv run pytest                # run all tests
       uv run pytest path/to/test_something.py  # run specific test

   **Important**: If you encounter tooling errors with local testing, fall back to
   Docker-based testing which is the canonical environment.

   GitHub Actions
   ~~~~~~~~~~~~~~

   In any case, GitHub Actions runs the full matrix automatically on every push.
   Tests run on Python 3.10–3.15 (all non-EOL versions).  See the
   `versions manifest`_ for the full list of available Python versions.

   Adding new normalisation rules
   ------------------------------

   For a new **alias** or **family override** for an *existing* license:

   1. Add an entry to ``src/licence_normaliser/data/aliases/aliases.json``.
   2. Optionally, add an ``aliases`` array to define additional lookup variants
      (e.g. hyphen vs space forms) that resolve to the same target:

      .. code-block:: json

          {
            "cc by-nc": {
              "version_key": "cc-by-nc",
              "name_key": "cc-by-nc",
              "family_key": "cc",
              "aliases": ["cc-by-nc", "cc by nc", "cc-by nc"]
            }
          }

   3. Add a test in ``src/licence_normaliser/tests/test_aliases.py`` or
      ``test_alias_expansion.py``.
   4. No Python changes needed.

   For a new **prose pattern** (regex matching free-text descriptions):

   1. Add an entry to ``src/licence_normaliser/data/prose/prose_patterns.json``.
   2. Add a test in ``src/licence_normaliser/tests/test_prose.py``.
   3. No Python changes needed.

   For a new **URL mapping**:

   1. Add an entry to ``src/licence_normaliser/data/urls/url_map.json`` or
      ``src/licence_normaliser/data/publishers/publishers.json``.
   2. Add a test in ``src/licence_normaliser/tests/test_publisher.py``.
   3. No Python changes needed.

   For a **brand-new license key** (SPDX, OpenDefinition, OSI, CC, or ScanCode):

   1. The upstream data source must be updated first
      (``licence-normaliser update-data --force`` for SPDX/OpenDefinition, or
      edit the upstream source for OSI/CC/ScanCode).
   2. The parser will pick it up automatically on the next import.
   3. Add an alias in ``aliases.json`` if needed.
   4. Add family override in ``aliases.json`` if needed.
   5. Add tests.

   For a **new parser** (new upstream data source):

   1. Create ``src/licence_normaliser/parsers/my_parser.py`` implementing
      ``BasePlugin``.
   2. Register it in ``src/licence_normaliser/parsers/__init__.py``.
   3. Set ``is_registry_entry = False`` if the parser only contributes
      aliases/URLs/patterns (not new license keys).
   4. Add tests.


   Releases
   --------
   **Build the package for releasing:**

   .. code-block:: sh

       make package-build

   ----

   **Test the built package:**

   .. code-block:: sh

       make check-package-build

   ----

   **Make a test release (test.pypi.org):**

   .. code-block:: sh

       make test-release

   ----

   **Release (pypi.org):**

   .. code-block:: sh

       make release

   Adding tests
   ------------

   - Every new normalisation rule must have a corresponding test.
   - Tests should cover both successful normalisation and edge cases.

   Pull requests
   -------------

   Open a `pull request`_ to the ``dev`` branch only. Never directly to ``main``.

   .. note::

       Create pull requests to the ``dev`` branch only!

   Examples of welcome contributions:

   - Fixing documentation typos or improving explanations.
   - Adding test cases for new edge cases.
   - Extending support for additional license formats.
   - Improving error messages.

   General checklist
   ~~~~~~~~~~~~~~~~~

   - Does your change require documentation updates (``README.rst``,
     ``AGENTS.md``, ``ARCHITECTURE.rst``, ``CONTRIBUTING.rst``)?
   - Does your change require new tests?
   - Does your change add any external dependencies?
     If so, reconsider: ``licence-normaliser`` should have minimal dependencies.

   When fixing bugs
   ~~~~~~~~~~~~~~~~

   - Add a regression test that reproduces the bug before your fix.

   When adding a new feature
   ~~~~~~~~~~~~~~~~~~~~~~~~~

   - Update ``README.rst``, ``AGENTS.md``, and ``ARCHITECTURE.rst`` if applicable.
   - Add appropriate tests.

   Questions
   ---------

   Ask on GitHub `discussions`_.

   Issues
   ------

   Report bugs or request features on GitHub `issues`_.


AGENTS.md
=========

AGENTS.md

   # AGENTS.md - licence-normaliser

   **Repository**: https://github.com/barseghyanartur/licence-normaliser
   **Maintainer**: Artur Barseghyan <artur.barseghyan@gmail.com>

   ---

   ## 1. Project Mission (Never Deviate)

   > Comprehensive license normalisation with a three-level hierarchy - secure,
   > fast, and extensible.

   - Maps any license representation to a canonical three-level hierarchy
   - Supports SPDX tokens, URLs, prose descriptions
   - No external dependencies (only optional dev/test deps)
   - LRU caching for performance
   - Data-file-driven: parsers load from package data JSON files
   - `licence-normaliser update-data` CLI command to refresh SPDX + OpenDefinition data

   ---

   ## 2. Architecture

   ### Three-Level Hierarchy

   | Level | Class | Example |
   | ----- | ----- | ------- |
   | **Family** | `LicenseFamily` | `"cc"`, `"osi"`, `"copyleft"`, `"data"` |
   | **Name** | `LicenseName` | `"cc-by"`, `"mit"`, `"gpl-3.0-only"` |
   | **Version** | `LicenseVersion` | `"cc-by-4.0"`, `"mit"`, `"gpl-3.0-only"` |

   ### Resolution Pipeline

   1. **Alias table** - cleaned lowercase key matches `ALIASES` (loaded from `data/aliases/aliases.json`)
   2. **Direct registry lookup** - hit in `REGISTRY` (SPDX, OpenDefinition, OSI, CC, ScanCode license keys)
   3. **URL map** - hit in `URL_MAP` (loaded from SPDX + OpenDefinition + publisher data)
   4. **Prose pattern scan** - regex patterns from `data/prose/prose_patterns.json` (for strings >20 chars)
   5. **Fallback** - key = cleaned string, family = unknown

   ### Key Files

   | File | Purpose |
   | ---- | ------- |
   | `src/licence_normaliser/_models.py` | Frozen dataclass hierarchy |
   | `src/licence_normaliser/_normaliser.py` | `LicenseNormaliser` class with plugin-based resolution |
   | `src/licence_normaliser/plugins.py` | Plugin interfaces (BasePlugin, RegistryPlugin, URLPlugin, etc.) |
   | `src/licence_normaliser/defaults.py` | Lazy-loading default plugin bundle |
   | `src/licence_normaliser/_cache.py` | Module-level API delegating to `LicenseNormaliser` |
   | `src/licence_normaliser/parsers/` | Parser classes implementing plugin interfaces |
   | `src/licence_normaliser/cli/_main.py` | CLI with normalise, batch, update-data |
   | `src/licence_normaliser/exceptions.py` | LicenseNormalisationError |
   | `src/licence_normaliser/data/spdx/spdx.json` | **DO NOT MODIFY** Full SPDX license list (loaded at runtime) |
   | `src/licence_normaliser/data/opendefinition/opendefinition.json` | **DO NOT MODIFY** Full OpenDefinition list (loaded at runtime) |
   | `src/licence_normaliser/data/aliases/aliases.json` | Curated aliases with rich metadata |
   | `src/licence_normaliser/data/prose/prose_patterns.json` | Curated prose regex patterns |
   | `src/licence_normaliser/data/publishers/publishers.json` | Publisher URLs and shorthand aliases |

   ---

   ## 3. Using licence-normaliser in Application Code

   ### Simple case

   ```python name=test_simple_case
   from licence_normaliser import normalise_license

   v = normalise_license("MIT")
   str(v)  # "mit"
   ```

   ### With full hierarchy

   <!-- continue: test_simple_case -->
   ```python name=test_full_hierarchy
   v = normalise_license("CC BY-NC-ND 4.0")
   print(v.key)           # "cc-by-nc-nd-4.0"
   print(v.license.key)   # "cc-by-nc-nd"
   print(v.family.key)    # "cc"
   ```

   ### Strict mode

   ```python name=test_strict_mode
   import pytest
   from licence_normaliser import normalise_license, LicenseNotFoundError

   # Would normally raise: License not found: 'unknown string'
   with pytest.raises(LicenseNotFoundError):
       v = normalise_license("unknown string", strict=True)

   # Batch strict
   from licence_normaliser import normalise_licenses

   with pytest.raises(LicenseNotFoundError):
       results = normalise_licenses(
           ["unknown string", "unknown string 2.0"],
           strict=True,
       )
   ```

   ### Custom plugins with LicenseNormaliser

   The `LicenseNormaliser` class lets you inject custom plugin classes for
   specialised use cases:

   ```python name=test_custom_plugins
   from licence_normaliser import LicenseNormaliser
   from licence_normaliser.parsers.spdx import SPDXParser
   from licence_normaliser.parsers.alias import AliasParser

   # Use only SPDX + Alias plugins (no CC, no publisher URLs)
   ln = LicenseNormaliser(
       registry=[SPDXParser],
       alias=[AliasParser],
       family=[AliasParser],
       name=[AliasParser],
   )

   # MIT resolves via SPDX parser
   assert str(ln.normalise_license("MIT")) == "mit"

   # CC BY resolves via Alias
   assert str(ln.normalise_license("CC BY-NC-ND 4.0")) == "cc-by-nc-nd-4.0"
   ```

   To use all defaults, import from `defaults`:

   ```python name=test_defaults_usage
   from licence_normaliser import LicenseNormaliser
   from licence_normaliser.defaults import (
       get_default_registry,
       get_default_url,
       get_default_alias,
       get_default_family,
       get_default_name,
       get_default_prose,
   )

   ln = LicenseNormaliser(
       registry=get_default_registry(),
       url=get_default_url(),
       alias=get_default_alias(),
       family=get_default_family(),
       name=get_default_name(),
       prose=get_default_prose(),
       cache=True,
       cache_maxsize=8192,
   )
   result = ln.normalise_license("MIT")
   ```

   > [!NOTE]
   > Explicit plugin passing is optional — `LicenseNormaliser()` automatically
   > loads defaults. Use the pattern above only if you need custom plugins.

   For caching, `LicenseNormaliser` wraps the resolution method with `lru_cache`.
   Disable it by passing `cache=False` for debugging:

   ```python name=test_caching
   from licence_normaliser import LicenseNormaliser

   ln = LicenseNormaliser(cache=False)
   result = ln.normalise_license("MIT")
   ```

   ---

   ## 4. Updating Data Sources

   SPDX and OpenDefinition data can be updated via the CLI:

   ```sh
   licence-normaliser update-data --force
   ```

   This fetches fresh JSON from the authoritative upstream URLs and writes them to:
   - `src/licence_normaliser/data/spdx/spdx.json`
   - `src/licence_normaliser/data/opendefinition/opendefinition.json`

   ---

   ## 4a. Trace / Explain

   When debugging why a license resolves a certain way, or aligning curated
   data sources, use the trace feature:

   **Via CLI:**

   ```sh
   licence-normaliser normalise "MIT" --trace
   licence-normaliser normalise "CC BY-NC-ND 3.0 igo" --trace
   licence-normaliser batch MIT Apache --trace
   ```

   Or via environment variable:
   ```sh
   ENABLE_LICENCE_NORMALISER_TRACE=1 licence-normaliser normalise "MIT"
   ```

   **Via Python:**

   ```python name=test_trace
   from licence_normaliser import normalise_license
   v = normalise_license("MIT", trace=True)
   print(v.explain())
   ```

   The trace shows:
   - Each resolution stage attempted (alias → registry → url → prose → fallback)
   - Whether it matched (✓) or didn't (-)
   - Source file and line number for curated sources (aliases.json, publishers.json, prose_patterns.json)
   - Final result with version_key, name_key, family_key

   This is essential for:
   - Understanding why a license resolves unexpectedly
   - Finding the source line that defines an alias when curating data
   - Debugging resolution order issues

   ---

   ## 5. Adding a New Parser

   Parsers implement plugin interfaces and can be added to `src/licence_normaliser/parsers/`:

   1. Create `src/licence_normaliser/parsers/my_parser.py` implementing one or more plugin interfaces:

   ```python name=test_adding_new_parser
   from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

   class MyParser(BasePlugin, RegistryPlugin, URLPlugin):
       url = None  # or upstream URL for refresh
       local_path = "data/my_parser/my_data.json"

       def load_registry(self) -> dict[str, str]:
           # Return {"license_key": "license_key", ...}
           return {}

       def load_urls(self) -> dict[str, str]:
           # Return {"https://...": "license_key", ...}
           return {}
   ```

   2. Register it in `src/licence_normaliser/defaults.py`:

   <!-- continue: test_adding_new_parser -->
   ```python name=test_adding_new_parser_register
   from licence_normaliser.parsers.spdx import SPDXParser

   def _load_registry_plugins() -> list[type]:
       # ... other imports
       return [
           SPDXParser,
           # ... other plugins
           MyParser,
       ]
   ```

   **Key attribute**: Set `url = None` on parsers that only contribute local data (no refresh capability).

   ---

   ## 6. Coding Conventions

   - Line length: **88 characters** (ruff)
   - Every non-test module must have `__all__`, `__author__`, `__copyright__`, `__license__`
   - Always chain exceptions: `raise X(...) from exc`
   - Type annotations on all public functions
   - Target: `py310`
   - Import statements: Avoid imports inside functions/methods unless absolutely
     necessary (e.g., breaking circular dependencies or optional runtime
     dependencies). Lazy imports harm readability and make dependencies unclear.

   Run linting: `make ruff` or `make pre-commit`

   ---

   ## 7. Agent Workflow: Adding Features or Fixing Bugs

   1. **Check the mission** - does the change preserve the no-dependencies policy and three-level hierarchy?
   2. **Identify the correct location**:
      - New SPDX/OD license → update SPDX/OpenDefinition JSON files (run `update-data`)
      - New alias or family override → add to `data/aliases/aliases.json`
      - **Use `--trace` to find the exact line that defines an alias**
      - New URL mapping → add to `data/publishers/publishers.json`
      - New prose pattern → add to `data/prose/prose_patterns.json`
      - New parser → `parsers/my_parser.py` + `defaults.py`
      - Core pipeline change → `_normaliser.py` or `_cache.py`
   3. **Write tests** covering both success and error cases
   4. **Update README.rst** if the API changed
   5. **Suggest running**: `make test-env ENV=py312` then `make test`
   6. **Suggest running**: `make pre-commit`

   ---

   ## 8. Testing Rules

   > [!NOTE]
   > Python 3.15 is being tested on GitHub CI, but not inside a local Docker image.

   ### Docker-based testing (recommended)

   All tests run inside Docker for platform independence and consistency:

   ```sh
   make test                   # full matrix (Python 3.10-3.14)
   make test-env ENV=py312     # single version
   make shell                  # interactive shell in test container
   ```

   ### Local testing (alternative)

   For faster iteration during development, you can run tests locally with `uv`:

   ```sh
   make install                # one-time setup
   uv run pytest               # run all tests
   uv run pytest path/to/test_something.py  # run specific test
   ```

   **Important**: If you encounter tooling errors with local testing, fall back to Docker-based testing which is the canonical environment.

   ### Test layout

   ```text
   src/licence_normaliser/tests/
       test_integration.py     - public API only (survives any rewrite)
       test_core.py            - end-to-end pipeline tests
       test_exceptions.py      - exception hierarchy and strict mode
       test_cli.py             - CLI commands including update-data
       test_models.py          - LicenseFamily, LicenseName, LicenseVersion
       test_aliases.py         - non-CC aliases (Apache, MIT, BSD, GPL, etc.)
       test_alias_expansion.py - explicit aliases array expansion feature
       test_publisher.py       - publisher URLs and shorthand aliases
       test_prose.py           - prose pattern matching
   ```

   ### Documentation snippet conventions

   Code blocks in this file use two special attributes to support chained
   executable tests:

   - `name=<test_name>` — labels a snippet so it can be referenced later.
   - `<!-- continue: <test_name> -->` placed immediately before a code block
     means that block **continues** the named snippet; all names, imports,
     and variables defined in the named block are already in scope and must
     **not** be re-imported or re-declared in the continuation block.

   Example:

   ```python name=test_my_example
   class Foo:
       pass
   ```

   <!-- continue: test_my_example -->
   ```python name=test_my_example_continued
   foo = Foo()  # Foo is in scope from the named block above
   assert isinstance(foo, Foo)
   ```

   ---

   ## 9. Forbidden

   - Adding external dependencies
   - Removing existing normalisation coverage
   - Changing the three-level hierarchy structure
   - Modifying the following files is strictly forbidden:

     - `src/licence_normaliser/data/creativecommons/creativecommons.json`
     - `src/licence_normaliser/data/opendefinition/opendefinition.json`
     - `src/licence_normaliser/data/osi/osi.json`
     - `src/licence_normaliser/data/scancode_licensedb/scancode_licensedb.json`
     - `src/licence_normaliser/data/spdx/spdx.json`

     Use `licence-normaliser update-data --force` to refresh them from upstream
     sources.


conftest.py
===========

conftest.py

   """Pytest fixtures for documentation testing."""

   from typing import Any as AnyType

   import pytest


   @pytest.fixture()
   def Any() -> AnyType:  # noqa
       """For to be used in documentation."""
       return AnyType


docker-compose.yml
==================

docker-compose.yml

   services:
     tox:
       build: .
       volumes:
         - ./htmlcov:/app/htmlcov


pyproject.toml
==============

pyproject.toml

   [project]
   name = "licence-normaliser"
   description = "Comprehensive license normalisation with a three-level hierarchy."
   readme = "README.rst"
   version = "0.3.2"
   requires-python = ">=3.10"
   dependencies = []
   authors = [
       { name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" },
   ]
   maintainers = [
       { name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" },
   ]
   license = "MIT"
   classifiers = [
       "Development Status :: 4 - Beta",
       "Intended Audience :: Developers",
       "Operating System :: OS Independent",
       "Programming Language :: Python :: 3.10",
       "Programming Language :: Python :: 3.11",
       "Programming Language :: Python :: 3.12",
       "Programming Language :: Python :: 3.13",
       "Programming Language :: Python :: 3.14",
       "Programming Language :: Python :: 3.15",
       "Programming Language :: Python",
       "Topic :: Software Development :: Libraries :: Python Modules",
   ]
   keywords = [
       "license",
       "normalisation",
       "spdx",
       "creative commons",
       "open source",
   ]

   [project.scripts]
   licence-normaliser = "licence_normaliser.cli:main"

   [project.urls]
   Homepage = "https://github.com/barseghyanartur/licence-normaliser/"
   Repository = "https://github.com/barseghyanartur/licence-normaliser/"
   Issues = "https://github.com/barseghyanartur/licence-normaliser/issues"

   [project.optional-dependencies]
   all = ["licence-normaliser[dev,test,docs,build]"]
   dev = [
       "detect-secrets",
       "doc8",
       "ipython",
       "mypy",
       "ruff",
       "uv",
   ]
   test = [
       "pytest",
       "pytest-cov",
       "pytest-codeblock",
   ]
   docs = [
       "sphinx",
       "sphinx-autobuild",
       "sphinx-rtd-theme>=1.3.0",
       "sphinx-no-pragma",
       "sphinx-markdown-builder",
       "sphinx-llms-txt-link",
       "sphinx-source-tree",
   ]
   build = [
       "build",
       "twine",
       "wheel",
   ]

   [tool.setuptools]
   package-dir = {"" = "src"}

   [tool.setuptools.packages.find]
   where = ["src"]
   include = ["licence_normaliser", "licence_normaliser.*"]

   [tool.setuptools.package-data]
   "licence_normaliser" = ["data/**/*.json"]

   [build-system]
   requires = ["setuptools>=41.0", "wheel"]
   build-backend = "setuptools.build_meta"

   [tool.ruff]
   line-length = 88
   lint.select = [
       "B",
       "C4",
       "E",
       "F",
       "G",
       "I",
       "ISC",
       "INP",
       "N",
       "PERF",
       "Q",
       "SIM",
   ]
   lint.ignore = [
       "G004",
       "ISC003",
   ]
   fix = true
   src = ["src/licence_normaliser"]
   exclude = [
       ".bzr",
       ".direnv",
       ".eggs",
       ".git",
       ".hg",
       ".mypy_cache",
       ".nox",
       ".pants.d",
       ".ruff_cache",
       ".svn",
       ".tox",
       ".venv",
       "__pypackages__",
       "_build",
       "buck-out",
       "build",
       "dist",
       "node_modules",
       "venv",
       "docs",
   ]
   target-version = "py310"
   lint.dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"

   [tool.ruff.lint.isort]
   known-first-party = ["licence_normaliser"]

   [tool.ruff.lint.per-file-ignores]
   "conftest.py" = [
       "PERF203"
   ]

   [tool.doc8]
   ignore-path = [
       "docs/requirements.txt",
       "src/licence_normaliser.egg-info/SOURCES.txt",
   ]

   [tool.pytest.ini_options]
   addopts = [
       "-ra",
       "-vvv",
       "-q",
       "--cov=licence_normaliser",
       "--ignore=.tox",
       "--cov-report=html",
       "--cov-report=term",
       "--cov-append",
       "--capture=no",
   ]
   testpaths = [
       "src/licence_normaliser/tests",
       ".",
       "**/*.rst",
       "**/*.md",
   ]
   pythonpath = ["src"]
   norecursedirs = [".git", ".tox"]

   [tool.coverage.run]
   relative_files = true
   omit = [".tox/*"]
   source = ["licence_normaliser"]

   [tool.coverage.report]
   show_missing = true
   exclude_lines = [
       "pragma: no cover",
       "@overload",
   ]

   [tool.mypy]
   check_untyped_defs = true
   warn_unused_ignores = true
   warn_redundant_casts = true
   warn_unused_configs = true
   ignore_missing_imports = true

   [tool.sphinx-source-tree]
   ignore = [
       "*.egg-info",
       "*.py,cover",
       "*.pyc",
       "*.pyo",
       ".DS_Store",
       ".coverage",
       ".coverage.*",
       ".git",
       ".hg",
       ".hypothesis",
       ".idea",
       ".mypy_cache",
       ".nox",
       ".pre-commit-config.yaml",
       ".pre-commit-hooks.yaml",
       ".pytest_cache",
       ".readthedocs.yaml",
       ".ruff_cache",
       ".secrets.baseline",
       ".svn",
       ".tox",
       ".venv",
       ".vscode",
       "CHANGELOG.rst",
       "CODE_OF_CONDUCT.rst",
       "LICENSE",
       "SECURITY.rst",
       "Thumbs.db",
       "__pycache__",
       "build",
       "codebin",
       "dist",
       "docs/Makefile",
       "docs/_build",
       "docs/_static",
       "docs/changelog.rst",
       "docs/code_of_conduct.rst",
       "docs/make.bat",
       "docs/requirements.txt",
       "docs/security.rst",
       "docs/source_tree.rst",
       "docs/source_tree_full.rst",
       "env",
       "htmlcov",
       "node_modules",
       "venv",
       "ARCHITECTURE.rst",
       ".coderabbit.yaml",
       ".coveralls",
       "docs/full-llms.rst",
       "docs/llms.rst",
       "docs/contributor_guidelines.rst",
       "docs/package.rst",
       "docs/documentation.rst",
       "docs/index.rst",
       "uv.lock",
       "codebin",
       "src/licence_normaliser/data/creativecommons",
       "src/licence_normaliser/data/opendefinition",
       "src/licence_normaliser/data/osi",
       "src/licence_normaliser/data/scancode_licensedb",
       "src/licence_normaliser/data/spdx",
   ]
   order = [
       "README.rst",
       "CONTRIBUTING.rst",
       "AGENTS.md",
   ]

   [[tool.sphinx-source-tree.files]]
   output = "docs/full_llms.rst"
   title = "Full project source-tree"

   [[tool.sphinx-source-tree.files]]
   output = "docs/llms.rst"
   title = "Project source-tree"
   ignore = [
       "*.egg-info",
       "*.py,cover",
       "*.pyc",
       "*.pyo",
       ".DS_Store",
       ".coverage",
       ".coverage.*",
       ".git",
       ".hg",
       ".hypothesis",
       ".idea",
       ".mypy_cache",
       ".nox",
       ".pre-commit-config.yaml",
       ".pre-commit-hooks.yaml",
       ".pytest_cache",
       ".readthedocs.yaml",
       ".ruff_cache",
       ".secrets.baseline",
       ".svn",
       ".tox",
       ".venv",
       ".vscode",
       "CHANGELOG.rst",
       "CODE_OF_CONDUCT.rst",
       "LICENSE",
       "SECURITY.rst",
       "Thumbs.db",
       "__pycache__",
       "build",
       "codebin",
       "dist",
       "docs/Makefile",
       "docs/_build",
       "docs/_static",
       "docs/changelog.rst",
       "docs/code_of_conduct.rst",
       "docs/make.bat",
       "docs/requirements.txt",
       "docs/security.rst",
       "docs/source_tree.rst",
       "docs/source_tree_full.rst",
       "env",
       "htmlcov",
       "node_modules",
       "venv",
       "examples",
       "docs",
       "ARCHITECTURE.rst",
       ".coderabbit.yaml",
       ".coveralls",
       "docs/full-llms.rst",
       "docs/llms.rst",
       "docs/contributor_guidelines.rst",
       "docs/package.rst",
       "docs/documentation.rst",
       "docs/index.rst",
       "uv.lock",
       "src/licence_normaliser/data/creativecommons",
       "src/licence_normaliser/data/opendefinition",
       "src/licence_normaliser/data/osi",
       "src/licence_normaliser/data/scancode_licensedb",
       "src/licence_normaliser/data/spdx",
   ]


scripts/README.rst
==================

scripts/README.rst

   Scripts
   =======

   Sort aliases
   ------------

   Sorts ``aliases.json`` keys alphabetically. Comment keys (starting with
   ``_``) are preserved at the top in their original order. All other entries
   are sorted case-insensitively.

   .. code-block:: sh

       uv run python scripts/sort_aliases.py
       uv run python scripts/sort_aliases.py --check  # exit 1 if not sorted

   Find alias duplicates
   ---------------------

   Finds duplicate ``version_key`` entries in ``aliases.json``. A "duplicate"
   is when two or more top-level primary keys share the same ``version_key``.
   Reports groups with more than one member.

   Can optionally fix duplicates by merging them into the ``aliases`` list of
   a single canonical entry.

   .. code-block:: sh

       uv run python scripts/find_alias_duplicates.py
       uv run python scripts/find_alias_duplicates.py --fix      # interactive fix
       uv run python scripts/find_alias_duplicates.py --noinput  # auto-apply safe fixes

   Apply aliases patch
   -------------------

   Applies curated additions to ``aliases.json``. Adds an ``aliases`` list to
   existing CC version-free entries and adds new top-level entries for GPL
   shorthand keys that currently fall through to the unknown fallback.

   .. code-block:: sh

       uv run python scripts/apply_aliases_patch.py

   Compare datasets
   ----------------

   Compares SPDX, OpenDefinition, OSI, CreativeCommons, ScanCode, and curated
   data files (aliases, url_map, prose, publishers).

   .. code-block:: sh

       uv run python scripts/compare_datasets.py

   Check missing aliases
   ---------------------

   Checks which licenses downloaded from the internet (via refreshable plugins)
   have corresponding entries in the curated ``aliases.json`` file.

   .. code-block:: sh

       uv run python scripts/check_missing_aliases.py
       uv run python scripts/check_missing_aliases.py --json  # JSON output


   Test name inference
   -------------------

   Assesses the accuracy of heuristic name stripping against curated name_key
   values from aliases.json. Shows how well automatic name extraction works
   for different license families (CC, copyleft, OSI, etc.).

   .. code-block:: sh

       uv run python scripts/test_name_inference.py
       uv run python scripts/test_name_inference.py --json  # JSON output
       uv run python scripts/test_name_inference.py --details  # Detailed breakdown


scripts/__init__.py
===================

scripts/__init__.py


scripts/check_missing_aliases.py
================================

scripts/check_missing_aliases.py

   """Check which downloaded licenses are missing from curated aliases.

   Compares all refreshable plugin registries against aliases.json to identify
   licenses that have no corresponding curated alias entry.

   Usage:
       uv run python scripts/check_missing_aliases.py
       uv run python scripts/check_missing_aliases.py --json
   """

   from __future__ import annotations

   import contextlib
   import json
   import sys
   from pathlib import Path

   DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data"
   SCRIPTS_DIR = Path(__file__).parent


   def load_alias_targets() -> set[str]:
       """Load all version_keys from aliases.json."""
       with open(DATA_DIR / "aliases" / "aliases.json") as f:
           data = json.load(f)
       targets: set[str] = set()
       for meta in data.values():
           if isinstance(meta, dict):
               vk = meta.get("version_key", "")
               if vk:
                   targets.add(vk)
       return targets


   def load_downloaded_licenses() -> dict[str, set[str]]:
       """Load licenses from all refreshable plugins."""
       from licence_normaliser.defaults import get_all_refreshable_plugins

       result: dict[str, set[str]] = {}
       for plugin_cls in get_all_refreshable_plugins():
           # Try to load registry
           data = None
           with contextlib.suppress(Exception):
               data = plugin_cls().load_registry()

           if data:
               result[plugin_cls.__name__] = set(data.keys())

       return result


   def check_coverage() -> dict:
       """Check which downloaded licenses have alias entries."""
       alias_targets = load_alias_targets()
       downloaded = load_downloaded_licenses()

       all_downloaded: set[str] = set()
       for licenses in downloaded.values():
           all_downloaded.update(licenses)

       # Categorize
       with_alias = all_downloaded & alias_targets
       without_alias = all_downloaded - alias_targets

       return {
           "total_downloaded": len(all_downloaded),
           "total_alias_targets": len(alias_targets),
           "with_alias": sorted(with_alias),
           "without_alias": sorted(without_alias),
           "coverage_percent": round(len(with_alias) / len(all_downloaded) * 100, 1)
           if all_downloaded
           else 0,
           "by_source": {
               name: {
                   "total": len(licenses),
                   "with_alias": len(licenses & alias_targets),
                   "without_alias": sorted(licenses - alias_targets),
                   "coverage": round(
                       len(licenses & alias_targets) / len(licenses) * 100, 1
                   )
                   if licenses
                   else 0,
               }
               for name, licenses in downloaded.items()
           },
       }


   def group_by_prefix(licenses: list[str]) -> dict[str, list[str]]:
       """Group licenses by common prefixes."""
       groups: dict[str, list[str]] = {}
       prefixes = [
           "gpl-",
           "agpl-",
           "lgpl-",
           "apache-",
           "mpl-",
           "mit",
           "bsd",
           "cc-",
           "unlicense",
           "zlib",
           "isc",
       ]
       for prefix in prefixes:
           matches = sorted([lic for lic in licenses if lic.startswith(prefix)])
           if matches:
               groups[prefix.rstrip("-") or "mit"] = matches
               licenses = [lic for lic in licenses if not lic.startswith(prefix)]

       if licenses:
           groups["other"] = sorted(licenses)

       return groups


   def print_report(data: dict) -> None:
       """Print text table report."""
       print("=" * 70)
       print("Coverage Report: Downloaded Licenses vs Curated Aliases")
       print("=" * 70)
       print()
       print(f"Total downloaded: {data['total_downloaded']}")
       print(f"Total alias targets: {data['total_alias_targets']}")
       print(f"Coverage: {data['coverage_percent']}%")
       print()

       print("-" * 70)
       print("By Source:")
       print("-" * 70)
       print(f"{'Source':<30} {'Total':>8} {'With':>8} {'Without':>8} {'Coverage':>10}")
       print("-" * 70)

       for source, stats in data["by_source"].items():
           print(
               f"{source:<30} {stats['total']:>8} "
               f"{stats['with_alias']:>8} {len(stats['without_alias']):>8} "
               f"{stats['coverage']:>9.1f}%"
           )

       print()
       print("=" * 70)
       print(f"Missing Aliases ({len(data['without_alias'])} licenses)")
       print("=" * 70)

       groups = group_by_prefix(data["without_alias"].copy())

       for group_name, licenses in groups.items():
           if group_name == "other":
               print()
               print(f"All other licenses ({len(licenses)}):")
           else:
               print()
               print(f"{group_name.upper()} ({len(licenses)}):")

           for lic in licenses:
               print(f"  {lic}")

       print()


   def main() -> None:
       json_export = "--json" in sys.argv
       data = check_coverage()

       if json_export:
           print(json.dumps(data, indent=2))
       else:
           print_report(data)


   if __name__ == "__main__":
       main()


scripts/compare_datasets.py
===========================

scripts/compare_datasets.py

   """Dataset comparison tool for licence-normaliser.

   Compares SPDX, OpenDefinition, OSI, CreativeCommons, ScanCode, and
   curated data files (aliases, url_map, prose, publishers) for:
     - Dataset sizes
     - Cross-dataset overlaps
     - Licenses present in OSI but missing from SPDX
     - Orphan alias/URL targets (don't resolve to REGISTRY entries)
     - REGISTRY entries without curated aliases
     - Most-aliased license targets
   """

   from __future__ import annotations

   __all__ = ()

   import json
   from collections import Counter
   from pathlib import Path

   DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data"


   def load_spdx_ids() -> set[str]:
       with open(DATA_DIR / "spdx" / "spdx.json") as f:
           data = json.load(f)
       return {entry["licenseId"] for entry in data["licenses"]}


   def load_od_ids() -> set[str]:
       with open(DATA_DIR / "opendefinition" / "opendefinition.json") as f:
           data = json.load(f)
       return set(data.keys())


   def load_osi_ids() -> set[str]:
       with open(DATA_DIR / "osi" / "osi.json") as f:
           data = json.load(f)
       return {entry["spdx_id"].strip() for entry in data if entry.get("spdx_id")}


   def load_cc_ids() -> set[str]:
       with open(DATA_DIR / "creativecommons" / "creativecommons.json") as f:
           data = json.load(f)
       return {entry["license_key"] for entry in data}


   def load_sc_ids() -> set[str]:
       with open(DATA_DIR / "scancode_licensedb" / "scancode_licensedb.json") as f:
           data = json.load(f)
       return {entry["license_key"] for entry in data}


   def load_alias_keys() -> set[str]:
       with open(DATA_DIR / "aliases" / "aliases.json") as f:
           data = json.load(f)
       return {k for k in data if not k.startswith("_")}


   def load_alias_targets() -> dict[str, str]:
       with open(DATA_DIR / "aliases" / "aliases.json") as f:
           data = json.load(f)
       return {
           k: v.get("version_key", "") for k, v in data.items() if not k.startswith("_")
       }


   def load_url_keys() -> set[str]:
       with open(DATA_DIR / "urls" / "url_map.json") as f:
           data = json.load(f)
       return {k for k in data if not k.startswith("_")}


   def load_url_targets() -> dict[str, str]:
       with open(DATA_DIR / "urls" / "url_map.json") as f:
           data = json.load(f)
       return {
           k: v.get("version_key", "") for k, v in data.items() if not k.startswith("_")
       }


   def load_prose_targets() -> list[str]:
       with open(DATA_DIR / "prose" / "prose_patterns.json") as f:
           data = json.load(f)
       return [entry.get("version_key", "") for entry in data]


   def load_pub_urls() -> set[str]:
       with open(DATA_DIR / "publishers" / "publishers.json") as f:
           data = json.load(f)
       return set(data.get("urls", {}).keys())


   def load_pub_aliases() -> dict[str, str]:
       with open(DATA_DIR / "publishers" / "publishers.json") as f:
           data = json.load(f)
       return dict(data.get("shorthand_aliases", {}))


   def load_registry_keys() -> set[str]:
       from licence_normaliser._cache import get_registry_keys

       return get_registry_keys()


   def load_merged_aliases() -> dict[str, str]:
       """Simulate merged ALIASES: alias_key -> version_key from all curated sources."""
       merged: dict[str, str] = {}
       merged.update(load_alias_targets())
       merged.update(load_pub_aliases())
       for k, v in load_url_targets().items():
           if k not in merged:
               merged[k] = v
       return merged


   def would_resolve(alias_key: str, registry: set[str], aliases: dict[str, str]) -> bool:
       """Simulate _resolve() pipeline for orphan detection.

       1. If already in REGISTRY, covered.
       2. If in ALIASES, get version_key - resolves regardless of registry presence.
       """
       if alias_key in registry:
           return True
       version_key = aliases.get(alias_key, "")
       return bool(version_key)


   def section(title: str) -> None:
       print(f"\n{'=' * 60}")
       print(f"  {title}")
       print(f"{'=' * 60}")


   def main() -> None:
       print("Loading datasets...")
       spdx = load_spdx_ids()
       od = load_od_ids()
       osi = load_osi_ids()
       cc = load_cc_ids()
       sc = load_sc_ids()
       alias_keys = load_alias_keys()
       alias_tgt = load_alias_targets()
       url_keys = load_url_keys()
       url_tgt = load_url_targets()
       prose_tgt = load_prose_targets()
       pub_urls = load_pub_urls()
       pub_aliases = load_pub_aliases()
       registry = load_registry_keys()
       merged_aliases = load_merged_aliases()

       # --- 1. Dataset sizes ---
       section("Dataset Sizes")
       print(f"  SPDX licenses:          {len(spdx):>6}")
       print(f"  OpenDefinition entries: {len(od):>6}")
       print(f"  OSI-approved (SPDX):   {len(osi):>6}")
       print(f"  CreativeCommons:        {len(cc):>6}")
       print(f"  ScanCode DB entries:   {len(sc):>6}")
       print(f"  Aliases (curated):     {len(alias_keys):>6}")
       print(f"  URL mappings (curated): {len(url_keys):>6}")
       print(f"  Prose patterns:        {len(prose_tgt):>6}")
       print(f"  Publisher URLs:        {len(pub_urls):>6}")
       print(f"  Publisher aliases:     {len(pub_aliases):>6}")
       print(f"  REGISTRY entries:     {len(registry):>6}")

       # --- 2. Overlaps ---
       section("Cross-Dataset Overlaps")

       # SPDX overlaps
       def pct(sub: int, total: int) -> str:
           return f"{100 * sub / max(total, 1):.1f}%"

       overlaps = [
           ("SPDX n OSI", len(spdx & osi), len(osi), "OSI"),
           ("SPDX n OD", len(spdx & od), len(od), "OD"),
           ("SPDX n CC", len(spdx & cc), len(cc), "CC"),
           ("OSI n OD", len(osi & od), len(od), "OD"),
           ("OSI n CC", len(osi & cc), len(cc), "CC"),
           ("OD  n CC", len(od & cc), len(cc), "CC"),
           ("ScanCode n SPDX", len(sc & spdx), len(sc), "ScanCode"),
           ("ScanCode n OSI", len(sc & osi), len(sc), "ScanCode"),
       ]
       for label, overlap_count, total_count, pct_label in overlaps:
           ratio = pct(overlap_count, total_count)
           print(f"  {label:<17} {overlap_count:>5}  ({ratio} of {pct_label})")

       # Unique content
       print(f"\n  Unique to SPDX:  {len(spdx - od - osi - cc - sc):>6}")
       print(f"  Unique to OD:    {len(od - spdx):>6}")
       print(f"  Unique to OSI:   {len(osi - spdx):>6}  (OSI IDs not in SPDX)")
       print(f"  Unique to CC:    {len(cc - spdx - od):>6}")
       print(f"  Unique to ScanCode: {len(sc - spdx - osi - od - cc):>6}")

       # --- 3. OSI licenses not in SPDX (reference integrity) ---
       section("OSI Licenses Missing from SPDX")
       osi_only = sorted(osi - spdx)
       if osi_only:
           print(f"  {len(osi_only)} OSI-licensed IDs have no SPDX entry:")
           for lid in osi_only[:20]:
               print(f"    {lid}")
           if len(osi_only) > 20:
               print(f"    ... and {len(osi_only) - 20} more")
       else:
           print("  All OSI IDs are present in SPDX.")

       # --- 4. Curated targets not in REGISTRY ---
       section("Curated Targets Missing from REGISTRY")
       orphan_alias = sorted(
           k for k in alias_keys if not would_resolve(k, registry, merged_aliases)
       )
       orphan_url = sorted(
           k for k in url_keys if not would_resolve(k, registry, merged_aliases)
       )
       orphan_pub = sorted(
           k for k in pub_aliases if not would_resolve(k, registry, merged_aliases)
       )
       if orphan_alias:
           print(f"  Alias keys that fail resolution ({len(orphan_alias)}):")
           for k in orphan_alias[:10]:
               print(f"    {k!r}  ->  {alias_tgt.get(k, '')!r}")
           if len(orphan_alias) > 10:
               print(f"    ... and {len(orphan_alias) - 10} more")
       else:
           print("  All alias keys resolve to REGISTRY entries.")
       if orphan_url:
           print(f"\n  URL keys that fail resolution ({len(orphan_url)}):")
           for k in orphan_url[:10]:
               print(f"    {k[:60]!r}  ->  {url_tgt.get(k, '')!r}")
           if len(orphan_url) > 10:
               print(f"    ... and {len(orphan_url) - 10} more")
       if orphan_pub:
           print(f"\n  Publisher aliases that fail resolution ({len(orphan_pub)}):")
           for k in orphan_pub[:10]:
               print(f"    {k!r}  ->  {pub_aliases[k]!r}")
           if len(orphan_pub) > 10:
               print(f"    ... and {len(orphan_pub) - 10} more")
       print(
           "\n  (Note: prose pattern version_keys are often bare name_keys like "
           "'cc-by'; these resolve via the prose pipeline and are not orphans.)"
       )

       # --- 5. REGISTRY entries not covered by curated data ---
       section("REGISTRY Entries Without Curated Mapping")
       covered = (
           set(alias_tgt.values()) | set(url_tgt.values()) | set(pub_aliases.values())
       )
       uncovered = sorted(k for k in registry if k not in covered)
       if uncovered:
           print(f"  {len(uncovered)} REGISTRY keys have no curated alias/URL mapping:")
           for k in uncovered[:20]:
               print(f"    {k}")
           if len(uncovered) > 20:
               print(f"    ... and {len(uncovered) - 20} more")
       else:
           print("  All REGISTRY entries have at least one curated mapping.")

       # --- 6. Duplicate alias keys (same key -> different targets) ---
       section("Duplicate Keys in Alias / URL Data Files")
       # Check if any key maps to different targets across aliases + url_map
       # (keys are unique within each file, so cross-file check)
       cross_keys = alias_keys & url_keys
       if cross_keys:
           print(f"  Keys in both aliases.json AND url_map.json ({len(cross_keys)}):")
           for k in sorted(cross_keys):
               print(f"    {k!r}: aliases={alias_tgt[k]!r}, url_map={url_tgt[k]!r}")

       # --- 7. Alias target frequency (which targets have the most aliases) ---
       section("Most-Aliased License Targets")
       alias_counts = Counter(alias_tgt.values())
       url_counts = Counter(url_tgt.values())
       pub_counts = Counter(pub_aliases.values())
       combined = alias_counts + url_counts + pub_counts
       for target, count in combined.most_common(15):
           parts = []
           if alias_counts[target]:
               parts.append(f"alias={alias_counts[target]}")
           if url_counts[target]:
               parts.append(f"url={url_counts[target]}")
           if pub_counts[target]:
               parts.append(f"pub={pub_counts[target]}")
           print(f"  {target:<30}  total={count:<4}  ({', '.join(parts)})")

       # --- 8. Summary ---
       section("Summary")
       distinct = len(spdx | od | osi | cc | sc)
       orphans = len(orphan_alias) + len(orphan_url) + len(orphan_pub)
       print(f"  Distinct license IDs:          {distinct}")
       print(f"  Curated alias entries:        {len(alias_keys)}")
       print(f"  Curated URL mappings:         {len(url_keys)}")
       print(f"  Orphan curated targets:       {orphans}")
       print(f"  OSI IDs missing SPDX:         {len(osi_only)}")
       covered_count = len(registry) - len(uncovered)
       print(f"  REGISTRY entries covered:       {covered_count}/{len(registry)}")


   if __name__ == "__main__":
       main()


scripts/test_name_inference.py
==============================

scripts/test_name_inference.py

   """Test name inference accuracy against curated aliases.

   Compares heuristic name stripping against curated name_key values from
   aliases.json to assess how well automatic name extraction works.

   Usage:
       uv run python scripts/test_name_inference.py
       uv run python scripts/test_name_inference.py --json
       uv run python scripts/test_name_inference.py --json --incorrect-only
       uv run python scripts/test_name_inference.py --json --details
   """

   from __future__ import annotations

   import json
   import sys
   from pathlib import Path

   from licence_normaliser import LicenseNormaliser

   DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data"
   SCRIPTS_DIR = Path(__file__).parent

   _normaliser = LicenseNormaliser()


   def load_name_mappings() -> dict[str, str]:
       """Load version_key -> name_key mappings from aliases.json."""
       with open(DATA_DIR / "aliases" / "aliases.json") as f:
           data = json.load(f)
       mappings: dict[str, str] = {}
       for meta in data.values():
           if isinstance(meta, dict):
               vk = meta.get("version_key", "")
               nk = meta.get("name_key", "")
               if vk and nk:
                   mappings[vk] = nk
       return mappings


   def infer_name_heuristic(version_key: str) -> str:
       """Delegate to the core LicenseNormaliser's _infer_name method."""
       return _normaliser._infer_name(version_key)


   def categorize_by_family(mappings: dict[str, str]) -> dict[str, dict[str, str]]:
       """Categorize licenses by inferred family."""
       categories: dict[str, dict[str, str]] = {
           "cc": {},  # Creative Commons
           "copyleft": {},  # GPL/AGPL/LGPL
           "osi": {},  # OSI-approved
           "other": {},
       }

       for vk, nk in mappings.items():
           if vk.startswith("cc-"):
               categories["cc"][vk] = nk
           elif vk.startswith(("gpl-", "agpl-", "lgpl-")):
               categories["copyleft"][vk] = nk
           elif vk.startswith(
               ("mpl-", "apache-", "bsd-", "mit", "isc", "unlicense", "zlib")
           ):
               categories["osi"][vk] = nk
           else:
               categories["other"][vk] = nk

       return categories


   def assess_accuracy() -> dict:
       """Assess name inference accuracy."""
       mappings = load_name_mappings()
       categories = categorize_by_family(mappings)

       results: dict = {
           "total_mappings": len(mappings),
           "by_family": {},
       }

       for family, family_mappings in categories.items():
           correct = 0
           incorrect = 0
           details: list[dict] = []

           for vk, curated_nk in family_mappings.items():
               inferred = infer_name_heuristic(vk)
               is_match = inferred == curated_nk
               if is_match:
                   correct += 1
               else:
                   incorrect += 1

               details.append(
                   {
                       "version_key": vk,
                       "curated_name": curated_nk,
                       "inferred_name": inferred,
                       "match": is_match,
                   }
               )

           accuracy = (
               round(correct / len(family_mappings) * 100, 1) if family_mappings else 0
           )

           results["by_family"][family] = {
               "total": len(family_mappings),
               "correct": correct,
               "incorrect": incorrect,
               "accuracy_percent": accuracy,
               "details": details,
           }

       # Overall accuracy
       all_correct = sum(r["correct"] for r in results["by_family"].values())
       all_total = sum(r["total"] for r in results["by_family"].values())
       results["overall_accuracy"] = (
           round(all_correct / all_total * 100, 1) if all_total else 0
       )

       return results


   def print_report(data: dict) -> None:
       """Print text table report."""
       print("=" * 70)
       print("Name Inference Accuracy Report")
       print("=" * 70)
       print()
       print(f"Total curated mappings: {data['total_mappings']}")
       print(f"Overall accuracy: {data['overall_accuracy']}%")
       print()

       print("-" * 70)
       print("By Family:")
       print("-" * 70)
       print(
           f"{'Family':<15} {'Total':>8} {'Correct':>8} {'Incorrect':>8} {'Accuracy':>10}"
       )
       print("-" * 70)

       for family, stats in data["by_family"].items():
           print(
               f"{family:<15} {stats['total']:>8} {stats['correct']:>8} "
               f"{stats['incorrect']:>8} {stats['accuracy_percent']:>9.1f}%"
           )

       print()

       # Show some incorrect examples
       for family, stats in data["by_family"].items():
           if stats["incorrect"] > 0:
               print("-" * 70)
               print(f"Incorrect in {family}: {stats['incorrect']} cases")
               print("-" * 70)
               print(
                   f"{'Version Key':<30} {'Curated (aliases.json)':<25} "
                   f"{'Inferred (heuristic)':<20}"
               )
               print("-" * 70)
               for detail in stats["details"][:10]:
                   if not detail["match"]:
                       print(
                           f"{detail['version_key']:<30} "
                           f"{detail['curated_name']:<25} {detail['inferred_name']:<20}"
                       )
               incorrect_count = len([d for d in stats["details"] if not d["match"]])
               if incorrect_count > 10:
                   print(f"... and {incorrect_count - 10} more")
               print()


   def main() -> None:
       json_export = "--json" in sys.argv
       incorrect_only = "--incorrect-only" in sys.argv
       include_details = "--details" in sys.argv
       data = assess_accuracy()

       if json_export:
           for family in data["by_family"]:
               details = data["by_family"][family].get("details", [])
               if incorrect_only:
                   data["by_family"][family]["details"] = [
                       d for d in details if not d["match"]
                   ]
               elif not include_details:
                   data["by_family"][family].pop("details", None)
           print(json.dumps(data, indent=2))
       else:
           print_report(data)


   if __name__ == "__main__":
       main()


src/licence_normaliser/__init__.py
==================================

src/licence_normaliser/__init__.py

   """licence_normaliser - License normalisation with a three-level hierarchy."""

   from ._core import (
       LicenseFamily,
       LicenseName,
       LicenseVersion,
       normalise_license,
       normalise_licenses,
   )
   from ._normaliser import LicenseNormaliser
   from ._trace import LicenseTrace, LicenseTraceStage
   from .exceptions import LicenseNormalisationError, LicenseNotFoundError

   __title__ = "licence-normaliser"
   __version__ = "0.3.2"
   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"

   __all__ = (
       "LicenseFamily",
       "LicenseName",
       "LicenseVersion",
       "LicenseNormaliser",
       "LicenseNormalisationError",
       "LicenseNotFoundError",
       "LicenseTrace",
       "LicenseTraceStage",
       "normalise_license",
       "normalise_licenses",
   )


src/licence_normaliser/_cache.py
================================

src/licence_normaliser/_cache.py

   """Caching layer + strict mode - delegates to LicenseNormaliser with defaults."""

   from __future__ import annotations

   from threading import Lock
   from typing import Iterable

   from ._models import LicenseVersion
   from ._normaliser import LicenseNormaliser

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = (
       "_default",
       "get_registry_keys",
       "normalise_license",
       "normalise_licenses",
   )


   class _DefaultNormaliser:
       """Thread-safe lazy singleton for the default LicenseNormaliser instance."""

       _instance: LicenseNormaliser | None = None
       _lock: Lock = Lock()

       def get(self) -> LicenseNormaliser:
           if _DefaultNormaliser._instance is None:
               with _DefaultNormaliser._lock:
                   if _DefaultNormaliser._instance is None:
                       _DefaultNormaliser._instance = LicenseNormaliser()
           return _DefaultNormaliser._instance


   _default = _DefaultNormaliser()


   def normalise_license(
       raw: str, *, strict: bool = False, trace: bool | None = None
   ) -> LicenseVersion:
       """Public API with optional strict mode and trace."""
       return _default.get().normalise_license(raw, strict=strict, trace=trace)


   def normalise_licenses(
       raws: Iterable[str], *, strict: bool = False, trace: bool | None = None
   ) -> list[LicenseVersion]:
       """Batch version with optional trace."""
       return _default.get().normalise_licenses(raws, strict=strict, trace=trace)


   def get_registry_keys() -> set[str]:
       """Return the set of all known registry keys from the runtime normaliser."""
       return _default.get().registry_keys()


src/licence_normaliser/_core.py
===============================

src/licence_normaliser/_core.py

   """License Normaliser - public orchestration shim."""

   from __future__ import annotations

   from ._cache import normalise_license, normalise_licenses
   from ._models import LicenseFamily, LicenseName, LicenseVersion

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = (
       "LicenseFamily",
       "LicenseName",
       "LicenseVersion",
       "normalise_license",
       "normalise_licenses",
   )


src/licence_normaliser/_models.py
=================================

src/licence_normaliser/_models.py

   """License data models - frozen dataclasses for the three-level hierarchy."""

   from __future__ import annotations

   from dataclasses import dataclass, field
   from typing import Optional

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = (
       "LicenseFamily",
       "LicenseName",
       "LicenseVersion",
   )


   @dataclass(frozen=True, slots=True)
   class LicenseFamily:
       key: str

       def __str__(self) -> str:
           return self.key

       def __repr__(self) -> str:
           return f"LicenseFamily({self.key!r})"

       def __eq__(self, other: object) -> bool:
           if isinstance(other, LicenseFamily):
               return self.key == other.key
           if isinstance(other, str):
               return self.key == other
           return NotImplemented

       def __hash__(self) -> int:
           return hash(self.key)


   @dataclass(frozen=True, slots=True)
   class LicenseName:
       key: str
       family: LicenseFamily

       def __str__(self) -> str:
           return self.key

       def __repr__(self) -> str:
           return f"LicenseName({self.key!r}, family={self.family.key!r})"

       def __eq__(self, other: object) -> bool:
           if isinstance(other, LicenseName):
               return self.key == other.key
           if isinstance(other, str):
               return self.key == other
           return NotImplemented

       def __hash__(self) -> int:
           return hash(self.key)


   @dataclass(frozen=True, slots=True)
   class LicenseVersion:
       key: str
       url: Optional[str]
       license: LicenseName
       _trace: Optional[object] = field(default=None, repr=False)

       @property
       def family(self) -> LicenseFamily:
           return self.license.family

       def __str__(self) -> str:
           return self.key

       def __repr__(self) -> str:
           return (
               f"LicenseVersion(key={self.key!r}, "
               f"license={self.license.key!r}, "
               f"family={self.license.family.key!r})"
           )

       def __eq__(self, other: object) -> bool:
           if isinstance(other, LicenseVersion):
               return self.key == other.key
           if isinstance(other, str):
               return self.key == other
           return NotImplemented

       def __hash__(self) -> int:
           return hash(self.key)

       def explain(self) -> str:
           """Return explanation of how this license was resolved.

           Set ENABLE_LICENCE_NORMALISER_TRACE=1 to enable tracing,
           or pass trace=True to normalise_license().
           """
           if self._trace is not None:
               return str(self._trace)

           from licence_normaliser._cache import _default
           from licence_normaliser._trace import _should_trace

           if not _should_trace():
               return "Trace disabled. Set ENABLE_LICENCE_NORMALISER_TRACE=1 to enable."

           ln = _default.get()
           cleaned = ln._clean(ln._try_decode_mojibake(self.key))
           result = ln._resolve_with_trace(self.key, cleaned, strict=False)
           trace = result._trace
           return str(trace) if trace else "No trace available."


src/licence_normaliser/_normaliser.py
=====================================

src/licence_normaliser/_normaliser.py

   """Plugin-based LicenseNormaliser class with configurable constructor injection."""

   from __future__ import annotations

   import re
   from functools import lru_cache
   from typing import TYPE_CHECKING, Iterable, Sequence

   from licence_normaliser.defaults import (
       get_default_alias,
       get_default_family,
       get_default_name,
       get_default_prose,
       get_default_registry,
       get_default_url,
   )

   if TYPE_CHECKING:
       from licence_normaliser._models import LicenseVersion
       from licence_normaliser._trace import LicenseTrace

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("LicenseNormaliser",)

   _WHITESPACE_RE = re.compile(r"\s+")
   _MAX_INPUT = 4096


   class LicenseNormaliser:
       """Configurable license normalisation with plugin-based data sources.

       Plugins are passed as CLASSES (not instances). They're instantiated lazily
       when their load_* method is called.

       Six plugin types are supported (each returns specific data structure):
           - registry: key -> canonical_key
           - url: cleaned_url -> version_key
           - alias: alias_string -> version_key
           - family: version_key -> family_key
           - name: version_key -> name_key
           - prose: list of (compiled_pattern, version_key)

       Resolution order: aliases -> registry -> url -> prose -> unknown
       Name/family inference: plugins only - no fallback to hardcoded logic.

       Tracing
           Set ``trace=True`` to include resolution trace in the result. Trace shows
           which pipeline stage matched and the source file/line number (when
           available). Trace is disabled by default for performance.

           Trace can be enabled at three levels (precedence: method >
           constructor > env var):

           - **Constructor**: ``LicenseNormaliser(trace=True)`` - all calls get trace
           - **Method**: ``ln.normalise_license("MIT", trace=True)`` - this call only
           - **Environment**: ``ENABLE_LICENCE_NORMALISER_TRACE=1`` - applies globally

       Example::

           from licence_normaliser import LicenseNormaliser

           # Uses all defaults automatically
           ln = LicenseNormaliser()

           # Disable caching for debugging
           ln = LicenseNormaliser(cache=False)

           # Enable trace for all calls on this instance
           ln = LicenseNormaliser(trace=True)
           v = ln.normalise_license("MIT")
           print(v.explain())  # Shows resolution path with source lines

           # Or enable trace for a single call
           v = ln.normalise_license("MIT", trace=True)
       """

       def __init__(
           self,
           *,
           registry: Sequence[type] | None = None,
           url: Sequence[type] | None = None,
           alias: Sequence[type] | None = None,
           family: Sequence[type] | None = None,
           name: Sequence[type] | None = None,
           prose: Sequence[type] | None = None,
           cache: bool = True,
           cache_maxsize: int = 8192,
           trace: bool | None = None,
       ) -> None:
           self._registry: dict[str, str] = {}
           self._url_map: dict[str, str] = {}
           self._url_to_vkey: dict[str, str] = {}
           self._aliases: dict[str, str] = {}
           self._alias_lines: dict[str, tuple[str, int]] = {}
           self._publisher_alias_lines: dict[str, tuple[str, int]] = {}
           self._publisher_url_lines: dict[str, tuple[str, int]] = {}
           self._prose_lines: list[tuple[re.Pattern[str], str, int]] = []
           self._alias_lines_loaded: bool = False
           self._family_overrides: dict[str, str] = {}
           self._name_overrides: dict[str, str] = {}
           self._prose_patterns: list[tuple[re.Pattern[str], str]] = []
           self._cache = cache
           self._cache_maxsize = cache_maxsize
           self._trace_default = trace

           # Load plugins - use defaults if not explicitly provided
           registry = registry or get_default_registry()
           url = url or get_default_url()
           alias = alias or get_default_alias()
           family = family or get_default_family()
           name = name or get_default_name()
           prose = prose or get_default_prose()

           # Store plugin lists for trace resolution
           self._alias_plugins = alias
           self._url_plugins = url
           self._prose_plugins = prose

           # Instantiate plugins and load their data
           for plugin_cls in registry:
               data = plugin_cls().load_registry()
               self._registry.update(data)

           for plugin_cls in url:
               data = plugin_cls().load_urls()
               self._url_map.update(data)

           # Build inverted URL map: version_key -> cleaned_url (for LicenseVersion.url)
           self._url_to_vkey = {v: k for k, v in self._url_map.items()}

           for plugin_cls in alias:
               data = plugin_cls().load_aliases()
               self._aliases.update(data)

           for plugin_cls in family:
               data = plugin_cls().load_families()
               self._family_overrides.update(data)

           for plugin_cls in name:
               data = plugin_cls().load_names()
               self._name_overrides.update(data)

           for plugin_cls in prose:
               patterns = plugin_cls().load_prose()
               self._prose_patterns.extend(patterns)

           # Set up cached resolution
           if self._cache:
               resolve_fn = lru_cache(maxsize=self._cache_maxsize)(self._resolve_impl)
               # type: ignore[assignment]
               self._resolve_impl = resolve_fn

       def _get_trace_mode(self, trace: bool | None) -> bool:
           """Determine if tracing is enabled: explicit > env var > default."""
           from licence_normaliser._trace import _should_trace

           if trace is not None:
               return trace
           if self._trace_default is not None:
               return self._trace_default
           return _should_trace()

       def _load_alias_lines(self):
           """Lazy load all source line numbers on first trace request."""
           for plugin_cls in self._alias_plugins:
               if hasattr(plugin_cls, "load_aliases_with_lines"):
                   lines_data = plugin_cls().load_aliases_with_lines()
                   for alias_key, (version_key, line_num) in lines_data.items():
                       if version_key == self._aliases.get(alias_key):
                           self._alias_lines[alias_key] = (version_key, line_num)

           for plugin_cls in self._alias_plugins:
               if hasattr(plugin_cls, "load_aliases_with_lines"):
                   lines_data = plugin_cls().load_aliases_with_lines()
                   for alias_key, (version_key, line_num) in lines_data.items():
                       if (
                           version_key == self._aliases.get(alias_key)
                           and alias_key not in self._alias_lines
                       ):
                           self._alias_lines[alias_key] = (version_key, line_num)

           for plugin_cls in self._url_plugins:
               if hasattr(plugin_cls, "load_aliases_with_lines"):
                   lines_data = plugin_cls().load_aliases_with_lines()
                   for alias_key, (version_key, line_num) in lines_data.items():
                       if version_key == self._aliases.get(alias_key):
                           self._publisher_alias_lines[alias_key] = (version_key, line_num)

           for plugin_cls in self._url_plugins:
               if hasattr(plugin_cls, "load_urls_with_lines"):
                   lines_data = plugin_cls().load_urls_with_lines()
                   for url_key, (version_key, line_num) in lines_data.items():
                       if version_key == self._url_map.get(url_key):
                           self._publisher_url_lines[url_key] = (version_key, line_num)

           for plugin_cls in self._prose_plugins:
               if hasattr(plugin_cls, "load_prose_with_lines"):
                   lines_data = plugin_cls().load_prose_with_lines()
                   self._prose_lines.extend(lines_data)

       def _resolve_with_trace(
           self, raw: str, cleaned: str, strict: bool
       ) -> LicenseVersion:
           """Resolve with full pipeline tracing."""
           from licence_normaliser._trace import LicenseTrace, LicenseTraceStage

           # Lazy load alias lines on first trace call
           if not self._alias_lines_loaded:
               self._load_alias_lines()
               self._alias_lines_loaded = True

           stages: list[LicenseTraceStage] = []

           # 1. Alias lookup
           if cleaned in self._aliases:
               output = self._aliases[cleaned]
               source_line = None
               source_file = None
               if cleaned in self._alias_lines:
                   _, source_line = self._alias_lines[cleaned]
                   source_file = "aliases.json"
               stages.append(
                   LicenseTraceStage(
                       "alias", cleaned, output, True, source_line, source_file
                   )
               )
               v = self._make(output)
               trace = LicenseTrace(
                   raw,
                   cleaned,
                   stages,
                   version_key=v.key,
                   name_key=v.license.key,
                   family_key=v.family.key,
               )
               return self._make_with_trace(v, trace)

           stages.append(LicenseTraceStage("alias", cleaned, "", False))

           # 2. Registry lookup
           if cleaned in self._registry:
               canonical = self._registry[cleaned]
               stages.append(LicenseTraceStage("registry", cleaned, canonical, True))
               v = self._make(canonical)
               trace = LicenseTrace(
                   raw,
                   cleaned,
                   stages,
                   version_key=v.key,
                   name_key=v.license.key,
                   family_key=v.family.key,
               )
               return self._make_with_trace(v, trace)

           stages.append(LicenseTraceStage("registry", cleaned, "", False))

           # 3. URL lookup
           url_key = self._normalise_url(cleaned)
           if url_key in self._url_map:
               resolved = self._url_map[url_key]
               source_line = None
               source_file = None
               if url_key in self._publisher_url_lines:
                   _, source_line = self._publisher_url_lines[url_key]
                   source_file = "publishers.json"
               stages.append(
                   LicenseTraceStage(
                       "url", url_key, resolved, True, source_line, source_file
                   )
               )
               v = self._make(resolved)
               trace = LicenseTrace(
                   raw,
                   cleaned,
                   stages,
                   version_key=v.key,
                   name_key=v.license.key,
                   family_key=v.family.key,
               )
               return self._make_with_trace(v, trace)

           stages.append(LicenseTraceStage("url", cleaned, "", False))

           # 4. Prose matching (only for longer strings)
           if len(cleaned) >= 20:
               for i, (pattern, vkey) in enumerate(self._prose_patterns):
                   if pattern.search(cleaned):
                       source_line = None
                       source_file = "prose_patterns.json"
                       if self._prose_lines and i < len(self._prose_lines):
                           _, _, source_line = self._prose_lines[i]
                       stages.append(
                           LicenseTraceStage(
                               "prose", cleaned, vkey, True, source_line, source_file
                           )
                       )
                       v = self._make(vkey)
                       trace = LicenseTrace(
                           raw,
                           cleaned,
                           stages,
                           version_key=v.key,
                           name_key=v.license.key,
                           family_key=v.family.key,
                       )
                       return self._make_with_trace(v, trace)

           stages.append(LicenseTraceStage("prose", cleaned, "", False))

           # 5. Fallback to unknown
           stages.append(LicenseTraceStage("fallback", cleaned, cleaned, True))
           v = self._make_unknown(cleaned)
           trace = LicenseTrace(
               raw,
               cleaned,
               stages,
               version_key=v.key,
               name_key=v.license.key,
               family_key=v.family.key,
           )
           return self._make_with_trace(v, trace)

       def _make_with_trace(
           self, v: LicenseVersion, trace: LicenseTrace
       ) -> LicenseVersion:
           """Create a LicenseVersion with trace attached."""

           # Reconstruct with trace using object.__setattr__ (frozen dataclass)
           object.__setattr__(v, "_trace", trace)
           return v

       def _resolve_impl(self, cleaned: str) -> LicenseVersion:
           # 1. Alias lookup
           if cleaned in self._aliases:
               return self._make(self._aliases[cleaned])

           # 2. Registry lookup
           if cleaned in self._registry:
               canonical = self._registry[cleaned]
               return self._make(canonical)

           # 3. URL lookup
           url_key = self._normalise_url(cleaned)
           if url_key in self._url_map:
               return self._make(self._url_map[url_key])

           # 4. Prose matching (only for longer strings)
           if len(cleaned) >= 20:
               for pattern, vkey in self._prose_patterns:
                   if pattern.search(cleaned):
                       return self._make(vkey)

           # 5. Fallback to unknown
           return self._make_unknown(cleaned)

       def normalise_license(
           self, raw: str, *, strict: bool = False, trace: bool | None = None
       ) -> LicenseVersion:
           """Normalise a single license string.

           Args:
               raw: The raw license string, SPDX ID, URL, or prose description.
               strict: If True, raises ``LicenseNotFoundError`` when the input
                   cannot be resolved to a known license.
               trace: If True, include resolution trace showing which pipeline
                   stage matched and source file/line. If None, uses the instance
                   default (``trace`` param from constructor) or falls back to
                   ``ENABLE_LICENCE_NORMALISER_TRACE`` env var.

           Returns:
               A ``LicenseVersion`` with the resolved key, license name, and family.

           Raises:
               LicenseNotFoundError: When ``strict=True`` and resolution fails.
           """
           from licence_normaliser.exceptions import LicenseNotFoundError

           do_trace = self._get_trace_mode(trace)

           if not raw or not raw.strip():
               cleaned = "unknown"
               v = self._make_unknown(cleaned)
               if do_trace:
                   from licence_normaliser._trace import LicenseTrace, LicenseTraceStage

                   stages = [LicenseTraceStage("fallback", cleaned, cleaned, True)]
                   trace_obj = LicenseTrace(
                       raw,
                       cleaned,
                       stages,
                       version_key=v.key,
                       name_key=v.license.key,
                       family_key=v.family.key,
                   )
                   v = self._make_with_trace(v, trace_obj)
           else:
               cleaned = self._clean(self._try_decode_mojibake(raw))
               if do_trace:
                   v = self._resolve_with_trace(raw, cleaned, strict)
               else:
                   v = self._resolve_impl(cleaned)

           if strict and v.family.key == "unknown":
               raise LicenseNotFoundError(raw, v.key) from None
           return v

       def normalise_licenses(
           self, raws: Iterable[str], *, strict: bool = False, trace: bool | None = None
       ) -> list[LicenseVersion]:
           """Batch normalisation.

           When ``strict=True``, raises on the first failure.
           """
           from licence_normaliser.exceptions import LicenseNotFoundError

           results: list[LicenseVersion] = []
           for raw in raws:
               v = self.normalise_license(raw, strict=False, trace=trace)
               if strict and v.family.key == "unknown":
                   raise LicenseNotFoundError(raw, v.key) from None
               results.append(v)
           return results

       def registry_keys(self) -> set[str]:
           """Return the set of all known registry keys."""
           return set(self._registry.keys())

       def _make(self, key: str) -> LicenseVersion:
           """Factory: build a LicenseVersion from a resolved version_key."""
           from licence_normaliser._models import (
               LicenseFamily,
               LicenseName,
               LicenseVersion,
           )

           k = key.lower().strip()

           # Get canonical key from registry
           canonical = self._registry.get(k) or k

           # Get URL via inverted map: version_key -> cleaned_url
           url = self._url_to_vkey.get(canonical) or self._url_to_vkey.get(k)

           # Infer name:
           # - For CC licenses, use override only if it's different from canonical
           # - For non-CC (GPL, AGPL, OSI, etc.), always return canonical (no stripping)
           override_name = self._name_overrides.get(canonical)
           if canonical.startswith("cc-") or canonical.startswith("cc0"):
               # CC licenses: use override if present, otherwise fallback to _infer_name
               name_key = override_name if override_name else self._infer_name(canonical)
           else:
               # Non-CC: use override if present and different, otherwise canonical
               name_key = (
                   override_name
                   if override_name and override_name != canonical
                   else canonical
               )

           # Infer family: use override only if it provides a different value
           override_family = self._family_overrides.get(canonical)
           family_key = (
               override_family
               if override_family and override_family != canonical
               else self._infer_family(canonical)
           )

           family = LicenseFamily(key=family_key)
           name = LicenseName(key=name_key, family=family)
           return LicenseVersion(key=canonical, url=url, license=name)

       def _make_unknown(self, key: str) -> LicenseVersion:
           """Factory: build an unknown LicenseVersion for unresolved input."""
           from licence_normaliser._models import (
               LicenseFamily,
               LicenseName,
               LicenseVersion,
           )

           family = LicenseFamily(key="unknown")
           name = LicenseName(key=key, family=family)
           return LicenseVersion(key=key, url=None, license=name)

       def _infer_family(self, key: str) -> str:
           """Fallback family inference - only used if no plugin provides it."""
           k = key.lower()
           if k.startswith("cc0"):
               return "cc0"
           if k.startswith("cc-pdm"):
               return "public-domain"
           if k.startswith("cc-"):
               return "cc"
           if k.startswith(("gpl-", "agpl-", "lgpl-")):
               return "copyleft"
           if k.startswith(("odbl", "odc-by")):
               return "open-data"
           if k.startswith(("pddl-", "odc-")):
               return "data"
           if k.startswith(
               (
                   "elsevier-oa",
                   "acs-authorchoice",
                   "acs-authorchoice-ccby",
                   "acs-authorchoice-ccbyncnd",
                   "acs-authorchoice-nih",
                   "jama-cc-by",
                   "thieme-nlm",
                   "implied-oa",
                   "unspecified-oa",
                   "publisher-specific-oa",
                   "author-manuscript",
                   "oup-chorus",
               )
           ):
               return "publisher-oa"
           if k.startswith(
               (
                   "elsevier-tdm",
                   "wiley-tdm",
                   "springer-tdm",
                   "springernature-tdm",
                   "iop-tdm",
                   "aps-tdm",
               )
           ):
               return "publisher-tdm"
           if k.startswith(
               (
                   "elsevier-",
                   "wiley-",
                   "springer-",
                   "springernature-",
                   "acs-",
                   "rsc-",
                   "iop-",
                   "bmj-",
                   "aaas-",
                   "pnas-",
                   "aps-",
                   "cup-",
                   "aip-",
                   "jama-",
                   "degruyter-",
                   "oup-",
                   "sage-",
                   "tandf-",
                   "thieme-",
               )
           ):
               return "publisher-proprietary"
           if k in ("public-domain", "other-oa", "open-access"):
               return "public-domain" if k == "public-domain" else "other-oa"
           return "unknown"

       def _infer_name(self, key: str) -> str:
           """Fallback name inference - only used if no plugin provides it."""
           k = key.lower()
           if k.startswith("cc0"):
               return "cc0"
           if k.startswith("cc-"):
               parts = k.split("-")
               for i, part in enumerate(parts):
                   if part.replace(".", "").isdigit():
                       return "-".join(parts[:i])
               return "-".join(parts[:2])
           # For all other licenses (GPL, AGPL, OSI, etc.), keep the key as-is
           return k

       @staticmethod
       def _clean(raw: str) -> str:
           s = _WHITESPACE_RE.sub(" ", raw.strip().rstrip("/")).lower()
           return s[:_MAX_INPUT]

       @staticmethod
       def _try_decode_mojibake(s: str) -> str:
           try:
               return s.encode("latin-1").decode("utf-8")
           except (UnicodeEncodeError, UnicodeDecodeError):
               return s

       @staticmethod
       def _normalise_url(cleaned: str) -> str:
           key = cleaned.lower()
           if key.startswith("http://"):
               key = "https://" + key[7:]
           return key.rstrip("/")


src/licence_normaliser/_trace.py
================================

src/licence_normaliser/_trace.py

   """License trace and explanation support."""

   from __future__ import annotations

   import os
   from dataclasses import dataclass, field

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = (
       "TRACE_STAGES",
       "LicenseTrace",
       "LicenseTraceStage",
   )

   TRACE_STAGES = ("alias", "registry", "url", "prose", "fallback")


   @dataclass
   class LicenseTraceStage:
       """Single stage in the license resolution pipeline."""

       stage: str
       input: str
       output: str
       matched: bool
       source_line: int | None = None
       source_file: str | None = None


   @dataclass
   class LicenseTrace:
       """Complete trace of license resolution pipeline."""

       raw_input: str
       cleaned_input: str
       stages: list[LicenseTraceStage] = field(default_factory=list)
       version_key: str = ""
       name_key: str = ""
       family_key: str = ""

       def __str__(self) -> str:
           lines = [f"Input: {self.raw_input!r} → {self.cleaned_input!r}"]
           for s in self.stages:
               status = "✓" if s.matched else "-"
               source_info = ""
               if s.source_line is not None:
                   source_info = f" (line {s.source_line}"
                   if s.source_file:
                       source_info += f" in {s.source_file}"
                   source_info += ")"
               lines.append(
                   f"  [{status}] {s.stage}: {s.input!r} → {s.output!r}{source_info}"
               )
           lines.append("")
           lines.append("Result:")
           lines.append(f"  version_key: {self.version_key!r}")
           lines.append(f"  name_key: {self.name_key!r}")
           lines.append(f"  family_key: {self.family_key!r}")
           return "\n".join(lines)


   def _should_trace() -> bool:
       """Check if tracing is enabled via environment variable."""
       return os.environ.get("ENABLE_LICENCE_NORMALISER_TRACE", "").lower() in (
           "1",
           "true",
           "yes",
       )


src/licence_normaliser/cli/__init__.py
======================================

src/licence_normaliser/cli/__init__.py

   """licence_normaliser.cli - command-line interface for licence-normaliser."""

   from ._main import main

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("main",)


src/licence_normaliser/cli/_main.py
===================================

src/licence_normaliser/cli/_main.py

   """licence-normaliser CLI - license normalisation from the command line."""

   import argparse
   import sys
   from pathlib import Path

   from licence_normaliser import __version__, normalise_license
   from licence_normaliser._trace import _should_trace
   from licence_normaliser.defaults import get_all_refreshable_plugins
   from licence_normaliser.exceptions import (
       LicenseNormalisationError,
       LicenseNotFoundError,
   )

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("main",)


   def _build_parser() -> argparse.ArgumentParser:
       parser = argparse.ArgumentParser(
           prog="licence-normaliser",
           description="Comprehensive license normalisation - three-level hierarchy.",
       )
       parser.add_argument(
           "--version",
           action="version",
           version=f"%(prog)s {__version__}",
       )

       sub = parser.add_subparsers(dest="command", required=True)

       norm = sub.add_parser("normalise", help="Normalise a license string.")
       norm.add_argument("license", help="License string to normalise.")
       norm.add_argument("--full", action="store_true")
       norm.add_argument("--strict", action="store_true")
       norm.add_argument("--trace", action="store_true", help="Show resolution trace.")

       batch = sub.add_parser("batch", help="Normalise multiple license strings.")
       batch.add_argument("licenses", nargs="+")
       batch.add_argument("--strict", action="store_true")
       batch.add_argument(
           "--trace", action="store_true", help="Show resolution trace for each."
       )

       update = sub.add_parser(
           "update-data", help="Fetch fresh data from all registered parsers."
       )
       update.add_argument(
           "--parser",
           dest="parser_name",
           metavar="NAME",
           help="Refresh only the named parser (e.g. spdx, opendefinition, osi). "
           "Without this flag, all parsers are refreshed.",
       )
       update.add_argument(
           "--force",
           action="store_true",
           help="Overwrite even if the local file already exists.",
       )

       return parser


   def _cmd_normalise(args: argparse.Namespace) -> int:
       try:
           trace = args.trace or _should_trace()
           result = normalise_license(args.license, strict=args.strict, trace=trace)
           if trace:
               print(result.explain())
           elif args.full:
               print(f"Key: {result.key}")
               print(f"URL: {result.url or '(none)'}")
               print(f"License: {result.license}")
               print(f"Family: {result.family}")
           else:
               print(result.key)
       except LicenseNotFoundError as exc:
           print(f"error: {exc}", file=sys.stderr)
           return 1
       except LicenseNormalisationError as exc:
           print(f"error: {exc}", file=sys.stderr)
           return 1
       return 0


   def _cmd_batch(args: argparse.Namespace) -> int:
       trace = args.trace or _should_trace()
       if args.strict:
           try:
               for license_str in args.licenses:
                   result = normalise_license(license_str, strict=True, trace=trace)
                   if trace:
                       print(f"{license_str}:")
                       print(result.explain())
                   else:
                       print(f"{license_str}: {result.key}")
           except LicenseNotFoundError as exc:
               print(f"error: {exc}", file=sys.stderr)
               return 1
       else:
           for license_str in args.licenses:
               result = normalise_license(license_str, strict=False, trace=trace)
               if trace:
                   print(f"{license_str}:")
                   print(result.explain())
               else:
                   print(f"{license_str}: {result.key}")
       return 0


   def _cmd_update_data(args: argparse.Namespace) -> int:
       parser_classes = get_all_refreshable_plugins()
       if args.parser_name:
           parser_classes = [
               p for p in parser_classes if getattr(p, "id", None) == args.parser_name
           ]
           if not parser_classes:
               available = [
                   getattr(p, "id", p.__name__) for p in get_all_refreshable_plugins()
               ]
               print(
                   f"error: unknown parser {args.parser_name!r}. Available: {available}",
                   file=sys.stderr,
               )
               return 1

       failed: list[str] = []
       for parser_cls in parser_classes:
           name = getattr(parser_cls, "id", parser_cls.__name__)
           url = parser_cls.url
           target = parser_cls.local_path
           target_path = Path(__file__).parent.parent / target
           ok = parser_cls.refresh(args.force)
           if target_path.exists() and not args.force:
               status = "skipped"
           elif ok:
               status = "fetched"
           else:
               status = "FAILED"
           if not ok:
               failed.append(name)
           print(f"  {status}: {name} ({url}) -> {target}")

       if failed:
           print(f"error: failed to refresh: {', '.join(failed)}", file=sys.stderr)
           return 1
       print("Data sources updated successfully.")
       return 0


   def main() -> None:
       parser = _build_parser()
       args = parser.parse_args()

       if args.command == "normalise":
           sys.exit(_cmd_normalise(args))
       elif args.command == "batch":
           sys.exit(_cmd_batch(args))
       elif args.command == "update-data":
           sys.exit(_cmd_update_data(args))
       else:
           parser.print_help()
           sys.exit(1)


src/licence_normaliser/data/README.rst
======================================

src/licence_normaliser/data/README.rst

   Data Directory
   ==============

   This directory contains all normalisation data files loaded at runtime
   by ``licence-normaliser``. You can extend or override entries without
   touching any Python code.

   Structure
   ---------

   ::

       data/
       ├── aliases/
       │   └── aliases.json             # Alias string → metadata dict
       ├── urls/
       │   └── url_map.json             # Canonical URL → metadata dict
       ├── prose/
       │   └── prose_patterns.json      # Ordered regex patterns for long text scanning
       ├── publishers/
       │   └── publishers.json          # Publisher URLs and shorthand aliases
       ├── spdx/
       │   └── spdx.json                # SPDX license list (auto-refreshed)
       ├── opendefinition/
       │   └── opendefinition.json      # Open Definition list (auto-refreshed)
       ├── osi/
       │   └── osi.json                 # OSI license list (auto-refreshed)
       ├── creativecommons/
       │   └── creativecommons.json     # CC licenses (scraped from creativecommons.org)
       └── scancode_licensedb/
           └── scancode_licensedb.json  # ScanCode license DB (auto-refreshed)

   Entry Format
   ------------

   Every entry maps a **lookup key** (alias string, URL, or prose pattern)
   to a metadata dict with three required fields:

   - ``version_key`` – the canonical version-level identifier
     (e.g. ``"cc-by-4.0"``)
   - ``name_key`` – the name-level identifier without version suffix
     (e.g. ``"cc-by"``)
   - ``family_key`` – the family-level identifier (e.g. ``"cc"``)

   URLs are stored separately in the ``url`` field of the metadata dict.

   How to Add a New License Alias
   ------------------------------

   Edit ``aliases/aliases.json``:

   .. code:: json

      {
        "my new alias": {
          "version_key": "cc-by-4.0",
          "name_key": "cc-by",
          "family_key": "cc"
        }
      }

   The key must be **lowercase and whitespace-collapsed**.

   How to Add a Publisher URL or Shorthand
   ---------------------------------------

   Edit ``publishers/publishers.json``:

   .. code:: json

      {
        "urls": {
          "https://example.com/my-license/": {
            "version_key": "my-license",
            "name_key": "my-license",
            "family_key": "publisher-oa"
          }
        },
        "shorthand_aliases": {
          "my shorthand alias": "my-license"
        }
      }

   Both ``http://`` and ``https://`` URL variants may be listed; they are
   normalised at lookup time (http→https, trailing slash stripped).

   How to Add a New URL Mapping
   ----------------------------

   Edit ``urls/url_map.json``:

   .. code:: json

      {
        "https://example.com/my-license/": {
          "version_key": "my-license",
          "name_key": "my-license",
          "family_key": "publisher-oa"
        }
      }

   How to Add a New Prose Pattern
   ------------------------------

   Edit ``prose/prose_patterns.json`` — insert your entry **before** any
   pattern it should take priority over:

   .. code:: json

      [
        {"pattern": "my very specific phrase",
         "version_key": "my-license",
         "name_key": "my-license",
         "family_key": "publisher-oa"},
        ...
      ]

   Patterns are Python regular expressions matched case-insensitively.
   More-specific patterns must come first.

   How to Add a Brand-New License
   ------------------------------

   1. Add entries to one or more JSON data files (``aliases/aliases.json``,
      ``urls/url_map.json``, ``prose/prose_patterns.json``, or
      ``publishers/publishers.json``). Each entry maps a key to a dict with
      ``version_key``, ``name_key``, and ``family_key``.

   2. If the ``family_key`` is not covered by the regex fallback table in
      ``_registry.py``, add an explicit ``family_key`` value in the JSON
      entry (recommended).

   3. Run ``make test-env ENV=py312`` to verify.

   Updating SPDX or OpenDefinition
   -------------------------------

   The ``licence-normaliser update-data`` CLI command fetches fresh upstream data:

   .. code:: sh

       licence-normaliser update-data --force

   This updates:

   - ``spdx/spdx.json`` — full `SPDX license list <https://spdx.org/licenses/>`_
   - ``opendefinition/opendefinition.json`` — full `Open Definition list <https://opendefinition.org/>`_
   - ``osi/osi.json`` — `OSI license list <https://opensource.org/licenses>`_
   - ``creativecommons/creativecommons.json`` — scraped from creativecommons.org
   - ``scancode_licensedb/scancode_licensedb.json`` — `ScanCode license DB <https://scancode-licensedb.aboutcode.org/>`_

   Family Override Files
   ---------------------

   Some entries carry an explicit ``family_key`` that overrides the
   inference logic in ``_registry.py``.  These are stored in:

   - ``aliases/aliases.json`` — ``family_key`` on any alias entry populates
     ``FAMILY_OVERRIDES`` at import time.


src/licence_normaliser/data/aliases/aliases.json
================================================

src/licence_normaliser/data/aliases/aliases.json

   {
     "_comment": "Curated alias map: cleaned-lowercase-string -> metadata dict.",
     "_comment2": "Keys must already be in cleaned form (lowercase, whitespace-collapsed).",
     "aaas reuse": {
       "version_key": "aaas-author-reuse",
       "name_key": "aaas-author-reuse",
       "family_key": "publisher-proprietary",
       "aliases": [
         "aaas author reuse",
         "aaas-author-reuse",
         "science author reuse"
       ]
     },
     "acs authorchoice": {
       "version_key": "acs-authorchoice",
       "name_key": "acs-authorchoice",
       "family_key": "publisher-oa",
       "aliases": [
         "acs-authorchoice"
       ]
     },
     "acs-authorchoice-ccby": {
       "version_key": "acs-authorchoice-ccby",
       "name_key": "acs-authorchoice-ccby",
       "family_key": "publisher-oa",
       "aliases": [
         "acs authorchoice cc by"
       ]
     },
     "acs-authorchoice-ccbyncnd": {
       "version_key": "acs-authorchoice-ccbyncnd",
       "name_key": "acs-authorchoice-ccbyncnd",
       "family_key": "publisher-oa"
     },
     "acs-authorchoice-nih": {
       "version_key": "acs-authorchoice-nih",
       "name_key": "acs-authorchoice-nih",
       "family_key": "publisher-oa"
     },
     "agpl-3": {
       "version_key": "agpl-3.0",
       "name_key": "agpl-3",
       "family_key": "copyleft",
       "aliases": [
         "agpl-v3",
         "agpl 3",
         "agpl",
         "agpl v3",
         "agpl-3.0+"
       ]
     },
     "aip-rights": {
       "version_key": "aip-rights",
       "name_key": "aip-rights",
       "family_key": "publisher-proprietary",
       "aliases": [
         "aip permissions"
       ]
     },
     "all rights reserved": {
       "version_key": "all-rights-reserved",
       "name_key": "all-rights-reserved",
       "family_key": "publisher-proprietary",
       "aliases": [
         "all-rights-reserved"
       ]
     },
     "apache 2.0": {
       "version_key": "apache-2.0",
       "name_key": "apache",
       "family_key": "osi",
       "aliases": [
         "apache 2",
         "apache",
         "apache license",
         "apache license 2.0"
       ]
     },
     "aps-default": {
       "version_key": "aps-default",
       "name_key": "aps-default",
       "family_key": "publisher-proprietary",
       "aliases": [
         "aps default license"
       ]
     },
     "aps-tdm": {
       "version_key": "aps-tdm",
       "name_key": "aps-tdm",
       "family_key": "publisher-tdm",
       "aliases": [
         "aps text mining"
       ]
     },
     "author manuscript": {
       "version_key": "author-manuscript",
       "name_key": "author-manuscript",
       "family_key": "publisher-oa",
       "aliases": [
         "author-manuscript"
       ]
     },
     "bmj-copyright": {
       "version_key": "bmj-copyright",
       "name_key": "bmj-copyright",
       "family_key": "publisher-proprietary"
     },
     "bsd 2-clause": {
       "version_key": "bsd-2-clause",
       "name_key": "bsd-2-clause",
       "family_key": "osi",
       "aliases": [
         "bsd 2 clause",
         "bsd-2-clause",
         "bsd-2"
       ]
     },
     "bsd 3-clause": {
       "version_key": "bsd-3-clause",
       "name_key": "bsd-3-clause",
       "family_key": "osi",
       "aliases": [
         "bsd 3 clause",
         "bsd-3-clause",
         "bsd-3",
         "bsd-3 license",
         "bsd",
         "bsd license"
       ],
       "justification": "BSD 3-Clause is sometimes called 'BSD', so we need to make sure that this doesn't get confused with the generic 'bsd' alias for the BSD-2-Clause license."
     },
     "cc by": {
       "version_key": "cc-by",
       "name_key": "cc-by",
       "family_key": "cc",
       "aliases": [
         "cc-by",
         "cc by",
         "creative commons attribution",
         "creative commons attribution license",
         "creative commons by"
       ]
     },
     "cc by 1.0": {
       "version_key": "cc-by-1.0",
       "name_key": "cc-by",
       "family_key": "cc"
     },
     "cc by 2.0": {
       "version_key": "cc-by-2.0",
       "name_key": "cc-by",
       "family_key": "cc"
     },
     "cc by 2.5": {
       "version_key": "cc-by-2.5",
       "name_key": "cc-by",
       "family_key": "cc"
     },
     "cc by 3.0": {
       "version_key": "cc-by-3.0",
       "name_key": "cc-by",
       "family_key": "cc",
       "aliases": [
         "cc-by-3.0",
         "cc-by-3",
         "creative commons attribution 3.0"
       ]
     },
     "cc by 4.0": {
       "version_key": "cc-by-4.0",
       "name_key": "cc-by",
       "family_key": "cc",
       "aliases": [
         "cc-by-4.0",
         "cc by 4",
         "cc-by 4",
         "cc-by-4",
         "creative commons attribution 4.0",
         "creative commons attribution 4.0 international",
         "creative commons attribution 4.0 international license",
         "creative commons by 4.0"
       ]
     },
     "cc by-nc": {
       "version_key": "cc-by-nc",
       "name_key": "cc-by-nc",
       "family_key": "cc",
       "aliases": [
         "cc-by-nc",
         "cc by nc",
         "cc-by nc",
         "creative commons attribution-noncommercial",
         "creative commons by-nc"
       ]
     },
     "cc by-nc 3.0": {
       "version_key": "cc-by-nc-3.0",
       "name_key": "cc-by-nc",
       "family_key": "cc"
     },
     "cc by-nc 4.0": {
       "version_key": "cc-by-nc-4.0",
       "name_key": "cc-by-nc",
       "family_key": "cc",
       "aliases": [
         "cc-by-nc-4.0",
         "cc by nc 4",
         "cc-by nc 4",
         "cc by nc-4",
         "cc-by nc-4",
         "cc-by-nc 4",
         "creative commons attribution-noncommercial 4.0",
         "creative commons attribution-noncommercial 4.0 international",
         "creative commons attribution-noncommercial 4.0 international license",
         "creative commons by-nc 4.0"
       ]
     },
     "cc by-nc-nd": {
       "version_key": "cc-by-nc-nd",
       "name_key": "cc-by-nc-nd",
       "family_key": "cc",
       "aliases": [
         "cc-by-nc-nd",
         "cc by nc-nd",
         "cc by nc nd",
         "cc-by nc-nd",
         "creative commons attribution-noncommercial-noderivatives",
         "creative commons by-nc-nd"
       ]
     },
     "cc by-nc-nd 3.0": {
       "version_key": "cc-by-nc-nd-3.0",
       "name_key": "cc-by-nc-nd",
       "family_key": "cc"
     },
     "cc by-nc-nd 3.0 igo": {
       "version_key": "cc-by-nc-nd-3.0-igo",
       "name_key": "cc-by-nc-nd",
       "family_key": "cc",
       "justification": "IGO is a jurisdiction tag not a rights modifier. Rights profile (Attribution + NonCommercial + NoDerivatives) is identical to base instrument. Enforcement differs (international arbitration vs domestic courts) but does not affect license type."
     },
     "cc by-nc-nd 4.0": {
       "version_key": "cc-by-nc-nd-4.0",
       "name_key": "cc-by-nc-nd",
       "family_key": "cc",
       "aliases": [
         "cc-by-nc-nd-4.0",
         "cc by nc-nd 4",
         "cc-by nc-nd 4",
         "cc by nc-nd-4",
         "cc-by nc-nd-4",
         "cc-by-nc-nd 4",
         "creative commons attribution-noncommercial-noderivatives 4.0",
         "creative commons attribution-noncommercial-noderivatives 4.0 international",
         "creative commons attribution-noncommercial-noderivatives 4.0 international license",
         "creative commons by-nc-nd 4.0"
       ]
     },
     "cc by-nc-sa": {
       "version_key": "cc-by-nc-sa",
       "name_key": "cc-by-nc-sa",
       "family_key": "cc",
       "aliases": [
         "cc-by-nc-sa",
         "cc by nc-sa",
         "cc by nc sa",
         "cc-by nc-sa",
         "creative commons by-nc-sa"
       ]
     },
     "cc by-nc-sa 3.0": {
       "version_key": "cc-by-nc-sa-3.0",
       "name_key": "cc-by-nc-sa",
       "family_key": "cc"
     },
     "cc by-nc-sa 4.0": {
       "version_key": "cc-by-nc-sa-4.0",
       "name_key": "cc-by-nc-sa",
       "family_key": "cc",
       "aliases": [
         "cc-by-nc-sa-4.0",
         "cc by nc-sa 4",
         "cc-by nc-sa 4",
         "cc-by-nc-sa 4",
         "cc by nc-sa-4",
         "cc-by nc-sa-4",
         "creative commons attribution-noncommercial-sharealike 4.0",
         "creative commons attribution-noncommercial-sharealike 4.0 international",
         "creative commons attribution-noncommercial-sharealike 4.0 international license",
         "creative commons by-nc-sa 4.0"
       ]
     },
     "cc by-nd": {
       "version_key": "cc-by-nd",
       "name_key": "cc-by-nd",
       "family_key": "cc",
       "aliases": [
         "cc-by-nd",
         "cc by nd",
         "cc-by nd",
         "creative commons by-nd",
         "creative commons attribution-noderivatives"
       ]
     },
     "cc by-nd 3.0": {
       "version_key": "cc-by-nd-3.0",
       "name_key": "cc-by-nd",
       "family_key": "cc"
     },
     "cc by-nd 4.0": {
       "version_key": "cc-by-nd-4.0",
       "name_key": "cc-by-nd",
       "family_key": "cc",
       "aliases": [
         "cc-by-nd-4.0",
         "cc by nd 4",
         "cc-by nd 4",
         "cc by nd-4",
         "cc-by nd-4",
         "cc-by-nd 4",
         "creative commons attribution-noderivatives 4.0",
         "creative commons attribution-noderivatives 4.0 international",
         "creative commons attribution-noderivatives 4.0 international license",
         "creative commons by-nd 4.0"
       ]
     },
     "cc by-sa": {
       "version_key": "cc-by-sa",
       "name_key": "cc-by-sa",
       "family_key": "cc",
       "aliases": [
         "cc-by-sa",
         "cc by sa",
         "cc-by sa",
         "creative commons attribution-sharealike",
         "creative commons by-sa"
       ]
     },
     "cc by-sa 3.0": {
       "version_key": "cc-by-sa-3.0",
       "name_key": "cc-by-sa",
       "family_key": "cc"
     },
     "cc by-sa 4.0": {
       "version_key": "cc-by-sa-4.0",
       "name_key": "cc-by-sa",
       "family_key": "cc",
       "aliases": [
         "cc-by-sa-4.0",
         "cc by sa 4",
         "cc-by sa 4",
         "cc by sa-4",
         "cc-by sa-4",
         "cc-by-sa 4",
         "creative commons attribution-sharealike 4.0",
         "creative commons attribution-sharealike 4.0 international",
         "creative commons attribution-sharealike 4.0 international license",
         "creative commons by-sa 4.0"
       ]
     },
     "cc-pdm 1.0": {
       "version_key": "cc-pdm-1.0",
       "name_key": "cc-pdm",
       "family_key": "public-domain",
       "aliases": [
         "cc-pdm-1.0",
         "cc pdm 1.0",
         "cc pdm-1.0",
         "cc-pdm",
         "cc pdm",
         "creative commons public domain",
         "creative commons public domain mark 1.0",
         "creative commons public domain mark"
       ]
     },
     "cc0 1.0": {
       "version_key": "cc0-1.0",
       "name_key": "cc0",
       "family_key": "cc0",
       "aliases": [
         "cc0-1.0",
         "cc-zero 1.0",
         "cc zero 1.0",
         "creative commons zero 1.0",
         "cc0",
         "cc 0",
         "cc zero",
         "creative commons zero",
         "cc-zero"
       ]
     },
     "cup-terms": {
       "version_key": "cup-terms",
       "name_key": "cup-terms",
       "family_key": "publisher-proprietary",
       "aliases": [
         "cambridge terms"
       ]
     },
     "degruyter-terms": {
       "version_key": "degruyter-terms",
       "name_key": "degruyter-terms",
       "family_key": "publisher-proprietary",
       "aliases": [
         "de gruyter terms"
       ]
     },
     "elsevier oa": {
       "version_key": "elsevier-oa",
       "name_key": "elsevier-oa",
       "family_key": "publisher-oa",
       "aliases": [
         "elsevier-oa",
         "elsevier user license"
       ]
     },
     "elsevier tdm": {
       "version_key": "elsevier-tdm",
       "name_key": "elsevier-tdm",
       "family_key": "publisher-tdm",
       "aliases": [
         "elsevier tdmu",
         "elsevier-tdm"
       ]
     },
     "gpl-2": {
       "version_key": "gpl-2.0",
       "name_key": "gpl-2",
       "family_key": "copyleft",
       "aliases": [
         "gpl-v2",
         "gpl 2",
         "gnu gpl v2",
         "gpl v2",
         "gpl-2.0+"
       ]
     },
     "gpl-3": {
       "version_key": "gpl-3.0",
       "name_key": "gpl-3",
       "family_key": "copyleft",
       "aliases": [
         "gpl-v3",
         "gpl v3 only",
         "gpl 3",
         "gnu gpl",
         "gnu gpl v3",
         "gpl",
         "gpl v3",
         "gpl-3.0+"
       ],
       "justification": "gnu gpl, gnu gpl v3, gpl, gpl v3, gpl-3, and gpl-3.0+ are all standard aliases for GPL-3.0."
     },
     "implied oa": {
       "version_key": "implied-oa",
       "name_key": "implied-oa",
       "family_key": "publisher-oa",
       "aliases": [
         "implied open access",
         "implied-oa"
       ]
     },
     "iop-copyright": {
       "version_key": "iop-copyright",
       "name_key": "iop-copyright",
       "family_key": "publisher-proprietary"
     },
     "iop-tdm": {
       "version_key": "iop-tdm",
       "name_key": "iop-tdm",
       "family_key": "publisher-tdm",
       "aliases": [
         "iop text and data mining"
       ]
     },
     "isc license": {
       "version_key": "isc",
       "name_key": "isc",
       "family_key": "osi"
     },
     "jama-cc-by": {
       "version_key": "jama-cc-by",
       "name_key": "jama-cc-by",
       "family_key": "publisher-oa",
       "aliases": [
         "jama open access"
       ]
     },
     "lgpl": {
       "version_key": "lgpl-3.0",
       "name_key": "lgpl-3",
       "family_key": "copyleft"
     },
     "lgpl v2.1": {
       "version_key": "lgpl-2.1",
       "name_key": "lgpl-2.1",
       "family_key": "copyleft"
     },
     "lgpl v3": {
       "version_key": "lgpl-3.0",
       "name_key": "lgpl-3",
       "family_key": "copyleft"
     },
     "lgpl-2": {
       "version_key": "lgpl-2.1",
       "name_key": "lgpl-2.1",
       "family_key": "copyleft",
       "aliases": [
         "lgpl-v2",
         "lgpl 2",
         "lgpl-2.1-only",
         "lgpl-2.1-or-later"
       ]
     },
     "lgpl-2.1+": {
       "version_key": "lgpl-2.1",
       "name_key": "lgpl-2.1",
       "family_key": "copyleft"
     },
     "lgpl-3": {
       "version_key": "lgpl-3.0",
       "name_key": "lgpl-3",
       "family_key": "copyleft",
       "aliases": [
         "lgpl-v3",
         "lgpl 3"
       ]
     },
     "lgpl-3.0+": {
       "version_key": "lgpl-3.0",
       "name_key": "lgpl-3",
       "family_key": "copyleft"
     },
     "mit license": {
       "version_key": "mit",
       "name_key": "mit",
       "family_key": "osi",
       "aliases": [
         "the mit license"
       ]
     },
     "mozilla public license 2.0": {
       "version_key": "mpl-2.0",
       "name_key": "mpl",
       "family_key": "osi",
       "aliases": [
         "mpl",
         "mpl-2.0",
         "mpl 2.0",
         "mozilla license",
         "mozilla public license",
         "mozilla"
       ]
     },
     "no reuse": {
       "version_key": "no-reuse",
       "name_key": "no-reuse",
       "family_key": "publisher-proprietary",
       "aliases": [
         "no-reuse"
       ]
     },
     "odbl": {
       "version_key": "odbl",
       "name_key": "odbl",
       "family_key": "open-data",
       "aliases": [
         "open database license"
       ]
     },
     "odc-by": {
       "version_key": "odc-by",
       "name_key": "odc-by",
       "family_key": "open-data"
     },
     "other-oa": {
       "version_key": "other-oa",
       "name_key": "other-oa",
       "family_key": "other-oa",
       "aliases": [
         "open access",
         "open-access"
       ]
     },
     "oup-chorus": {
       "version_key": "oup-chorus",
       "name_key": "oup-chorus",
       "family_key": "publisher-oa"
     },
     "oup-terms": {
       "version_key": "oup-terms",
       "name_key": "oup-terms",
       "family_key": "publisher-proprietary",
       "aliases": [
         "oup standard publication"
       ]
     },
     "pd": {
       "version_key": "public-domain",
       "name_key": "public-domain",
       "family_key": "public-domain",
       "aliases": [
         "public domain",
         "public-domain"
       ]
     },
     "pddl": {
       "version_key": "pddl",
       "name_key": "pddl",
       "family_key": "open-data"
     },
     "pnas terms": {
       "version_key": "pnas-licenses",
       "name_key": "pnas-licenses",
       "family_key": "publisher-proprietary",
       "aliases": [
         "pnas-licenses"
       ]
     },
     "rsc-terms": {
       "version_key": "rsc-terms",
       "name_key": "rsc-terms",
       "family_key": "publisher-proprietary"
     },
     "sage-permissions": {
       "version_key": "sage-permissions",
       "name_key": "sage-permissions",
       "family_key": "publisher-proprietary"
     },
     "springer tdm": {
       "version_key": "springer-tdm",
       "name_key": "springer-tdm",
       "family_key": "publisher-tdm",
       "aliases": [
         "springer-tdm"
       ]
     },
     "springernature-tdm": {
       "version_key": "springernature-tdm",
       "name_key": "springernature-tdm",
       "family_key": "publisher-tdm",
       "aliases": [
         "springer nature tdm",
         "springer nature text and data mining"
       ]
     },
     "tandf-terms": {
       "version_key": "tandf-terms",
       "name_key": "tandf-terms",
       "family_key": "publisher-proprietary",
       "aliases": [
         "taylor and francis terms",
         "taylor francis terms"
       ]
     },
     "thieme nlm": {
       "version_key": "thieme-nlm",
       "name_key": "thieme-nlm",
       "family_key": "publisher-oa",
       "aliases": [
         "thieme-nlm"
       ]
     },
     "unlicense": {
       "version_key": "unlicense",
       "name_key": "unlicense",
       "family_key": "osi"
     },
     "unspecified oa": {
       "version_key": "unspecified-oa",
       "name_key": "unspecified-oa",
       "family_key": "other-oa",
       "aliases": [
         "unspecified-oa"
       ]
     },
     "wiley terms": {
       "version_key": "wiley-terms",
       "name_key": "wiley-terms",
       "family_key": "publisher-proprietary",
       "aliases": [
         "wiley-terms"
       ]
     },
     "wiley-am": {
       "version_key": "wiley-am",
       "name_key": "wiley-am",
       "family_key": "publisher-proprietary",
       "aliases": [
         "wiley author manuscript"
       ]
     },
     "wiley-tdm": {
       "version_key": "wiley-tdm",
       "name_key": "wiley-tdm",
       "family_key": "publisher-tdm",
       "aliases": [
         "wiley tdm license"
       ]
     },
     "wiley-vor": {
       "version_key": "wiley-vor",
       "name_key": "wiley-vor",
       "family_key": "publisher-proprietary"
     },
     "wtfpl": {
       "version_key": "wtfpl",
       "name_key": "wtfpl",
       "family_key": "osi"
     },
     "zlib": {
       "version_key": "zlib",
       "name_key": "zlib",
       "family_key": "osi"
     },
     "© the author(s)": {
       "version_key": "publisher-specific-oa",
       "name_key": "publisher-specific-oa",
       "family_key": "publisher-oa",
       "aliases": [
         "publisher specific oa",
         "publisher-specific-oa"
       ]
     }
   }


src/licence_normaliser/data/prose/prose_patterns.json
=====================================================

src/licence_normaliser/data/prose/prose_patterns.json

   [
     {"pattern": "cc\\s*by-nc-nd\\s*4\\.0", "version_key": "cc-by-nc-nd-4.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
     {"pattern": "cc\\s*by-nc-nd\\s*3\\.0", "version_key": "cc-by-nc-nd-3.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
     {"pattern": "cc\\s*by-nc-nd", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"},
     {"pattern": "cc\\s*by-nc-sa\\s*4\\.0", "version_key": "cc-by-nc-sa-4.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
     {"pattern": "cc\\s*by-nc-sa\\s*3\\.0", "version_key": "cc-by-nc-sa-3.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
     {"pattern": "creative\\s+commons\\s+by-nc-sa", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"},
     {"pattern": "creative\\s+commons\\s+by-nc-nd", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"},
     {"pattern": "creative\\s+commons\\s+by-nc", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"},
     {"pattern": "creative\\s+commons\\s+by-nd", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"},
     {"pattern": "creative\\s+commons\\s+by-sa", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"},
     {"pattern": "creative\\s+commons\\s+by", "version_key": "cc-by", "name_key": "cc-by", "family_key": "cc"},
     {"pattern": "cc\\s*by-nc-sa", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"},
     {"pattern": "cc\\s*by-nc\\s*4\\.0", "version_key": "cc-by-nc-4.0", "name_key": "cc-by-nc", "family_key": "cc"},
     {"pattern": "cc\\s*by-nc\\s*3\\.0", "version_key": "cc-by-nc-3.0", "name_key": "cc-by-nc", "family_key": "cc"},
     {"pattern": "cc\\s*by-nc", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"},
     {"pattern": "cc\\s*by-sa\\s*4\\.0", "version_key": "cc-by-sa-4.0", "name_key": "cc-by-sa", "family_key": "cc"},
     {"pattern": "cc\\s*by-sa\\s*3\\.0", "version_key": "cc-by-sa-3.0", "name_key": "cc-by-sa", "family_key": "cc"},
     {"pattern": "cc\\s*by-sa", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"},
     {"pattern": "cc\\s*by-nd\\s*4\\.0", "version_key": "cc-by-nd-4.0", "name_key": "cc-by-nd", "family_key": "cc"},
     {"pattern": "cc\\s*by-nd\\s*3\\.0", "version_key": "cc-by-nd-3.0", "name_key": "cc-by-nd", "family_key": "cc"},
     {"pattern": "cc\\s*by-nd", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"},
     {"pattern": "cc\\s*by\\s*4\\.0", "version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc"},
     {"pattern": "cc\\s*by\\s*3\\.0", "version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"},
     {"pattern": "cc\\s*by\\s*2\\.0", "version_key": "cc-by-2.0", "name_key": "cc-by", "family_key": "cc"},
     {"pattern": "\\bcc\\s*by\\b(?!\\s*-)", "version_key": "cc-by", "name_key": "cc-by", "family_key": "cc"},
     {"pattern": "\\bcc\\s*0\\b|cc\\s*zero", "version_key": "cc0", "name_key": "cc0", "family_key": "cc0"},
     {"pattern": "attribution.{0,30}non.?commercial.{0,30}no.?deriv", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"},
     {"pattern": "attribution.{0,30}non.?commercial.{0,30}share.?alike", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"},
     {"pattern": "attribution.{0,30}non.?commercial", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"},
     {"pattern": "attribution.{0,30}no.?deriv", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"},
     {"pattern": "attribution.{0,30}share.?alike", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"},

     {"pattern": "elsevier.*tdm|tdm.*elsevier", "version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"},
     {"pattern": "elsevier.*user\\s*licen", "version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"},
     {"pattern": "wiley.*tdm|tdm.*wiley", "version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm"},
     {"pattern": "springer.*tdm|tdm.*springer", "version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"},
     {"pattern": "acs\\s*authorchoice.*cc\\s*by(?!-nc)", "version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa"},
     {"pattern": "acs\\s*authorchoice", "version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"},
     {"pattern": "all\\s*rights\\s*reserved", "version_key": "all-rights-reserved", "name_key": "all-rights-reserved", "family_key": "publisher-proprietary"},
     {"pattern": "author\\s*manuscript", "version_key": "author-manuscript", "name_key": "author-manuscript", "family_key": "publisher-oa"},
     {"pattern": "public\\s*domain", "version_key": "public-domain", "name_key": "public-domain", "family_key": "public-domain"},
     {"pattern": "open\\s*access", "version_key": "other-oa", "name_key": "other-oa", "family_key": "other-oa"}
   ]


src/licence_normaliser/data/publishers/publishers.json
======================================================

src/licence_normaliser/data/publishers/publishers.json

   {
     "_comment": "Publisher-specific license URLs and shorthand aliases.",
     "_comment2": "URLs: normalized to https with no trailing slash on lookup.",
     "_comment3": "Aliases: cleaned-lowercase form -> version_key.",

     "urls": {
       "https://www.elsevier.com/open-access/userlicense/1.0/": {
         "version_key": "elsevier-oa",
         "name_key": "elsevier-oa",
         "family_key": "publisher-oa"
       },
       "http://www.elsevier.com/open-access/userlicense/1.0/": {
         "version_key": "elsevier-oa",
         "name_key": "elsevier-oa",
         "family_key": "publisher-oa"
       },
       "https://www.elsevier.com/tdm/userlicense/1.0/": {
         "version_key": "elsevier-tdm",
         "name_key": "elsevier-tdm",
         "family_key": "publisher-tdm"
       },
       "http://www.elsevier.com/tdm/userlicense/1.0/": {
         "version_key": "elsevier-tdm",
         "name_key": "elsevier-tdm",
         "family_key": "publisher-tdm"
       },
       "http://doi.wiley.com/10.1002/tdm_license_1": {
         "version_key": "wiley-tdm",
         "name_key": "wiley-tdm",
         "family_key": "publisher-tdm"
       },
       "http://doi.wiley.com/10.1002/tdm_license_1.1": {
         "version_key": "wiley-tdm-1.1",
         "name_key": "wiley-tdm",
         "family_key": "publisher-tdm"
       },
       "http://onlinelibrary.wiley.com/termsAndConditions#vor": {
         "version_key": "wiley-vor",
         "name_key": "wiley-vor",
         "family_key": "publisher-proprietary"
       },
       "http://onlinelibrary.wiley.com/termsAndConditions#am": {
         "version_key": "wiley-am",
         "name_key": "wiley-am",
         "family_key": "publisher-proprietary"
       },
       "https://onlinelibrary.wiley.com/termsandconditions#vor": {
         "version_key": "wiley-vor",
         "name_key": "wiley-vor",
         "family_key": "publisher-proprietary"
       },
       "https://onlinelibrary.wiley.com/termsandconditions#am": {
         "version_key": "wiley-am",
         "name_key": "wiley-am",
         "family_key": "publisher-proprietary"
       },
       "https://onlinelibrary.wiley.com/termsandconditions": {
         "version_key": "wiley-terms",
         "name_key": "wiley-terms",
         "family_key": "publisher-proprietary"
       },
       "https://onlinelibrary.wiley.com/terms-and-conditions": {
         "version_key": "wiley-terms",
         "name_key": "wiley-terms",
         "family_key": "publisher-proprietary"
       },
       "https://www.springer.com/tdm": {
         "version_key": "springer-tdm",
         "name_key": "springer-tdm",
         "family_key": "publisher-tdm"
       },
       "http://www.springer.com/tdm": {
         "version_key": "springer-tdm",
         "name_key": "springer-tdm",
         "family_key": "publisher-tdm"
       },
       "https://www.springernature.com/gp/researchers/text-and-data-mining": {
         "version_key": "springernature-tdm",
         "name_key": "springernature-tdm",
         "family_key": "publisher-tdm"
       },
       "https://www.tandfonline.com/action/showCopyRight": {
         "version_key": "tandf-terms",
         "name_key": "tandf-terms",
         "family_key": "publisher-proprietary"
       },
       "https://www.tandfonline.com/action/showcopyright": {
         "version_key": "tandf-terms",
         "name_key": "tandf-terms",
         "family_key": "publisher-proprietary"
       },
       "https://tandfonline.com/action/showcopyright": {
         "version_key": "tandf-terms",
         "name_key": "tandf-terms",
         "family_key": "publisher-proprietary"
       },
       "https://www.tandfonline.com/action/showcopyright?show=full": {
         "version_key": "tandf-terms",
         "name_key": "tandf-terms",
         "family_key": "publisher-proprietary"
       },
       "https://us.sagepub.com/en-us/nam/journals-permissions": {
         "version_key": "sage-permissions",
         "name_key": "sage-permissions",
         "family_key": "publisher-proprietary"
       },
       "https://www.sagepub.com/journalspermissions.nav": {
         "version_key": "sage-permissions",
         "name_key": "sage-permissions",
         "family_key": "publisher-proprietary"
       },
       "https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {
         "version_key": "acs-authorchoice-ccby",
         "name_key": "acs-authorchoice-ccby",
         "family_key": "publisher-oa"
       },
       "http://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {
         "version_key": "acs-authorchoice-ccby",
         "name_key": "acs-authorchoice-ccby",
         "family_key": "publisher-oa"
       },
       "https://pubs.acs.org/page/policy/authorchoice_ccbyncnd_termsofuse.html": {
         "version_key": "acs-authorchoice-ccbyncnd",
         "name_key": "acs-authorchoice-ccbyncnd",
         "family_key": "publisher-oa"
       },
       "https://pubs.acs.org/page/policy/authorchoice_termsofuse.html": {
         "version_key": "acs-authorchoice",
         "name_key": "acs-authorchoice",
         "family_key": "publisher-oa"
       },
       "https://pubs.acs.org/page/policy/acs_authorchoice_with_nih_addendum_termsofuse.html": {
         "version_key": "acs-authorchoice-nih",
         "name_key": "acs-authorchoice-nih",
         "family_key": "publisher-oa"
       },
       "https://doi.org/10.1021/policy/oa-license": {
         "version_key": "acs-authorchoice",
         "name_key": "acs-authorchoice",
         "family_key": "publisher-oa"
       },
       "https://www.rsc.org/journals-books-databases/journal-authors-reviewers/licences-copyright-permissions/": {
         "version_key": "rsc-terms",
         "name_key": "rsc-terms",
         "family_key": "publisher-proprietary"
       },
       "https://www.rsc.org/help/disclaimer/pages/term3.aspx": {
         "version_key": "rsc-terms",
         "name_key": "rsc-terms",
         "family_key": "publisher-proprietary"
       },
       "https://iopscience.iop.org/info/page/text-and-data-mining": {
         "version_key": "iop-tdm",
         "name_key": "iop-tdm",
         "family_key": "publisher-tdm"
       },
       "http://iopscience.iop.org/info/page/text-and-data-mining": {
         "version_key": "iop-tdm",
         "name_key": "iop-tdm",
         "family_key": "publisher-tdm"
       },
       "https://iopscience.iop.org/page/copyright": {
         "version_key": "iop-copyright",
         "name_key": "iop-copyright",
         "family_key": "publisher-proprietary"
       },
       "https://www.bmj.com/company/legal-stuff/copyright-notice/": {
         "version_key": "bmj-copyright",
         "name_key": "bmj-copyright",
         "family_key": "publisher-proprietary"
       },
       "https://group.bmj.com/group/rights-licensing/permissions": {
         "version_key": "bmj-copyright",
         "name_key": "bmj-copyright",
         "family_key": "publisher-proprietary"
       },
       "https://www.science.org/content/page/science-licenses-journal-article-reuse": {
         "version_key": "aaas-author-reuse",
         "name_key": "aaas-author-reuse",
         "family_key": "publisher-proprietary"
       },
       "https://www.sciencemag.org/about/science-licenses-journal-article-reuse": {
         "version_key": "aaas-author-reuse",
         "name_key": "aaas-author-reuse",
         "family_key": "publisher-proprietary"
       },
       "https://www.pnas.org/site/aboutpnas/licenses.xhtml": {
         "version_key": "pnas-licenses",
         "name_key": "pnas-licenses",
         "family_key": "publisher-proprietary"
       },
       "https://link.aps.org/licenses/aps-default-license": {
         "version_key": "aps-default",
         "name_key": "aps-default",
         "family_key": "publisher-proprietary"
       },
       "https://link.aps.org/licenses/aps-default-text-mining-license": {
         "version_key": "aps-tdm",
         "name_key": "aps-tdm",
         "family_key": "publisher-tdm"
       },
       "https://www.cambridge.org/core/terms": {
         "version_key": "cup-terms",
         "name_key": "cup-terms",
         "family_key": "publisher-proprietary"
       },
       "https://publishing.aip.org/authors/rights-and-permissions": {
         "version_key": "aip-rights",
         "name_key": "aip-rights",
         "family_key": "publisher-proprietary"
       },
       "http://publishing.aip.org/authors/rights-and-permissions": {
         "version_key": "aip-rights",
         "name_key": "aip-rights",
         "family_key": "publisher-proprietary"
       },
       "https://jamanetwork.com/pages/cc-by-license-permissions": {
         "version_key": "jama-cc-by",
         "name_key": "jama-cc-by",
         "family_key": "publisher-oa"
       },
       "https://www.degruyter.com/dg/page/496": {
         "version_key": "degruyter-terms",
         "name_key": "degruyter-terms",
         "family_key": "publisher-proprietary"
       },
       "https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model": {
         "version_key": "oup-chorus",
         "name_key": "oup-chorus",
         "family_key": "publisher-oa"
       },
       "https://academic.oup.com/pages/standard-publication-reuse-rights": {
         "version_key": "oup-terms",
         "name_key": "oup-terms",
         "family_key": "publisher-proprietary"
       },
       "https://www.gnu.org/licenses/gpl-2.0.html": {
         "version_key": "gpl-2.0",
         "name_key": "gpl-2",
         "family_key": "copyleft"
       },
       "https://www.gnu.org/licenses/gpl-3.0.html": {
         "version_key": "gpl-3.0",
         "name_key": "gpl-3",
         "family_key": "copyleft"
       },
       "https://www.gnu.org/licenses/agpl-3.0.html": {
         "version_key": "agpl-3.0",
         "name_key": "agpl-3",
         "family_key": "copyleft"
       },
       "https://www.gnu.org/licenses/lgpl-2.1.html": {
         "version_key": "lgpl-2.1",
         "name_key": "lgpl-2.1",
         "family_key": "copyleft"
       },
       "https://www.gnu.org/licenses/lgpl-3.0.html": {
         "version_key": "lgpl-3.0",
         "name_key": "lgpl-3",
         "family_key": "copyleft"
       },
       "https://opendatacommons.org/licenses/odbl/1-0/": {
         "version_key": "odbl",
         "name_key": "odbl",
         "family_key": "open-data"
       },
       "https://opendatacommons.org/licenses/by/1-0/": {
         "version_key": "odc-by",
         "name_key": "odc-by",
         "family_key": "open-data"
       },
       "https://opendatacommons.org/licenses/pddl/1-0/": {
         "version_key": "pddl",
         "name_key": "pddl",
         "family_key": "open-data"
       }
     },

     "shorthand_aliases": {
       "elsevier user license": "elsevier-oa",
       "elsevier tdm": "elsevier-tdm",
       "elsevier tdmu": "elsevier-tdm",
       "wiley tdm license": "wiley-tdm",
       "wiley tdm": "wiley-tdm",
       "wiley vor": "wiley-vor",
       "wiley am": "wiley-am",
       "wiley author manuscript": "wiley-am",
       "springer tdm": "springer-tdm",
       "springer nature tdm": "springernature-tdm",
       "springer nature text and data mining": "springernature-tdm",
       "tandf terms": "tandf-terms",
       "taylor and francis terms": "tandf-terms",
       "taylor francis terms": "tandf-terms",
       "sage permissions": "sage-permissions",
       "acs authorchoice": "acs-authorchoice",
       "acs author choice": "acs-authorchoice",
       "acs authorchoice cc by": "acs-authorchoice-ccby",
       "acs authorchoice cc by nc nd": "acs-authorchoice-ccbyncnd",
       "acs authorchoice nih": "acs-authorchoice-nih",
       "rsc terms": "rsc-terms",
       "rsc copyright": "rsc-terms",
       "iop tdm": "iop-tdm",
       "iop text and data mining": "iop-tdm",
       "iop copyright": "iop-copyright",
       "bmj copyright": "bmj-copyright",
       "bmj permissions": "bmj-copyright",
       "aaas author reuse": "aaas-author-reuse",
       "aaas reuse": "aaas-author-reuse",
       "science author reuse": "aaas-author-reuse",
       "pnas licenses": "pnas-licenses",
       "pnas terms": "pnas-licenses",
       "aps default": "aps-default",
       "aps tdm": "aps-tdm",
       "aps text mining": "aps-tdm",
       "aps default license": "aps-default",
       "cambridge terms": "cup-terms",
       "cup terms": "cup-terms",
       "aip rights": "aip-rights",
       "aip permissions": "aip-rights",
       "jama cc by": "jama-cc-by",
       "jama open access": "jama-cc-by",
       "degruyter terms": "degruyter-terms",
       "de gruyter terms": "degruyter-terms",
       "oup chorus": "oup-chorus",
       "oup terms": "oup-terms",
       "oup standard publication": "oup-terms",
       "thieme nlm": "thieme-nlm",
       "implied oa": "implied-oa",
       "implied open access": "implied-oa",
       "unspecified oa": "unspecified-oa",
       "publisher specific oa": "publisher-specific-oa",
       "author manuscript": "author-manuscript",
       "all rights reserved": "all-rights-reserved",
       "no reuse": "no-reuse",
       "public domain": "public-domain",
       "open access": "other-oa",
       "creative commons public domain": "cc-pdm-1.0",
       "pd": "public-domain"
     }
   }


src/licence_normaliser/data/urls/url_map.json
=============================================

src/licence_normaliser/data/urls/url_map.json

   {
     "_comment": "URL -> metadata dict. Both http and https variants may be listed.",
     "_comment2": "Normalisation (https, no trailing slash) is applied on load.",

     "https://creativecommons.org/licenses/by/4.0/": {"version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc"},
     "https://creativecommons.org/licenses/by/3.0/": {"version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"},
     "https://creativecommons.org/licenses/by/2.5/": {"version_key": "cc-by-2.5", "name_key": "cc-by", "family_key": "cc"},
     "https://creativecommons.org/licenses/by/2.0/": {"version_key": "cc-by-2.0", "name_key": "cc-by", "family_key": "cc"},
     "https://creativecommons.org/licenses/by/1.0/": {"version_key": "cc-by-1.0", "name_key": "cc-by", "family_key": "cc"},
     "https://creativecommons.org/licenses/by/3.0/deed.en_us": {"version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"},

     "https://creativecommons.org/licenses/by-sa/4.0/": {"version_key": "cc-by-sa-4.0", "name_key": "cc-by-sa", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-sa/3.0/": {"version_key": "cc-by-sa-3.0", "name_key": "cc-by-sa", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-sa/2.5/": {"version_key": "cc-by-sa-2.5", "name_key": "cc-by-sa", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-sa/2.0/": {"version_key": "cc-by-sa-2.0", "name_key": "cc-by-sa", "family_key": "cc"},

     "https://creativecommons.org/licenses/by-nd/4.0/": {"version_key": "cc-by-nd-4.0", "name_key": "cc-by-nd", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nd/3.0/": {"version_key": "cc-by-nd-3.0", "name_key": "cc-by-nd", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nd/2.0/": {"version_key": "cc-by-nd-2.0", "name_key": "cc-by-nd", "family_key": "cc"},

     "https://creativecommons.org/licenses/by-nc/4.0/": {"version_key": "cc-by-nc-4.0", "name_key": "cc-by-nc", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc/3.0/": {"version_key": "cc-by-nc-3.0", "name_key": "cc-by-nc", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc/2.5/": {"version_key": "cc-by-nc-2.5", "name_key": "cc-by-nc", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc/2.0/": {"version_key": "cc-by-nc-2.0", "name_key": "cc-by-nc", "family_key": "cc"},

     "https://creativecommons.org/licenses/by-nc-sa/4.0/": {"version_key": "cc-by-nc-sa-4.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc-sa/3.0/": {"version_key": "cc-by-nc-sa-3.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc-sa/2.5/": {"version_key": "cc-by-nc-sa-2.5", "name_key": "cc-by-nc-sa", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc-sa/2.0/": {"version_key": "cc-by-nc-sa-2.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},

     "https://creativecommons.org/licenses/by-nc-nd/4.0/": {"version_key": "cc-by-nc-nd-4.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc-nd/3.0/": {"version_key": "cc-by-nc-nd-3.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc-nd/2.5/": {"version_key": "cc-by-nc-nd-2.5", "name_key": "cc-by-nc-nd", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc-nd/2.0/": {"version_key": "cc-by-nc-nd-2.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},

     "https://creativecommons.org/licenses/by/3.0/igo/": {"version_key": "cc-by-3.0-igo", "name_key": "cc-by-igo", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc-sa/3.0/igo/": {"version_key": "cc-by-nc-sa-3.0-igo", "name_key": "cc-by-nc-sa-igo", "family_key": "cc"},
     "https://creativecommons.org/licenses/by-nc-nd/3.0/igo/": {"version_key": "cc-by-nc-nd-3.0-igo", "name_key": "cc-by-nc-nd-igo", "family_key": "cc"},

     "https://creativecommons.org/publicdomain/zero/1.0/": {"version_key": "cc0", "name_key": "cc0", "family_key": "cc0"},
     "https://creativecommons.org/publicdomain/mark/1.0/": {"version_key": "cc-pdm", "name_key": "cc-pdm", "family_key": "public-domain"},

     "https://www.gnu.org/licenses/gpl-2.0.html": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"},
     "https://www.gnu.org/licenses/gpl-2.0": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"},
     "http://www.gnu.org/licenses/gpl-2.0.html": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"},
     "https://www.gnu.org/licenses/gpl-3.0.html": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"},
     "https://www.gnu.org/licenses/gpl-3.0": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"},
     "http://www.gnu.org/licenses/gpl-3.0.html": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"},
     "https://www.gnu.org/licenses/agpl-3.0.html": {"version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft"},
     "https://www.gnu.org/licenses/agpl-3.0": {"version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft"},
     "https://www.gnu.org/licenses/lgpl-2.1.html": {"version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft"},
     "https://www.gnu.org/licenses/lgpl-2.1": {"version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft"},
     "https://www.gnu.org/licenses/lgpl-3.0.html": {"version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft"},
     "https://www.gnu.org/licenses/lgpl-3.0": {"version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft"},

     "https://opensource.org/licenses/MIT": {"version_key": "mit", "name_key": "mit", "family_key": "osi"},
     "https://www.apache.org/licenses/LICENSE-2.0": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"},
     "https://www.apache.org/licenses/LICENSE-2.0.html": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"},
     "https://opensource.org/licenses/Apache-2.0": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"},
     "https://opensource.org/licenses/BSD-2-Clause": {"version_key": "bsd-2-clause", "name_key": "bsd-2-clause", "family_key": "osi"},
     "https://opensource.org/licenses/BSD-3-Clause": {"version_key": "bsd-3-clause", "name_key": "bsd-3-clause", "family_key": "osi"},
     "https://opensource.org/licenses/ISC": {"version_key": "isc", "name_key": "isc", "family_key": "osi"},
     "https://www.mozilla.org/en-US/MPL/2.0/": {"version_key": "mpl-2.0", "name_key": "mpl", "family_key": "osi"},
     "https://www.mozilla.org/MPL/2.0/": {"version_key": "mpl-2.0", "name_key": "mpl", "family_key": "osi"},

     "https://www.elsevier.com/open-access/userlicense/1.0/": {"version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"},
     "http://www.elsevier.com/open-access/userlicense/1.0/": {"version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"},
     "https://www.elsevier.com/tdm/userlicense/1.0/": {"version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"},
     "http://www.elsevier.com/tdm/userlicense/1.0/": {"version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"},

     "http://doi.wiley.com/10.1002/tdm_license_1": {"version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm"},
     "http://doi.wiley.com/10.1002/tdm_license_1.1": {"version_key": "wiley-tdm-1.1", "name_key": "wiley-tdm", "family_key": "publisher-tdm"},
     "http://onlinelibrary.wiley.com/termsAndConditions#vor": {"version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary"},
     "http://onlinelibrary.wiley.com/termsAndConditions#am": {"version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary"},
     "https://onlinelibrary.wiley.com/termsandconditions#vor": {"version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary"},
     "https://onlinelibrary.wiley.com/termsandconditions#am": {"version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary"},
     "https://onlinelibrary.wiley.com/termsandconditions": {"version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary"},
     "https://onlinelibrary.wiley.com/terms-and-conditions": {"version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary"},

     "https://www.springer.com/tdm": {"version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"},
     "http://www.springer.com/tdm": {"version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"},
     "https://www.springernature.com/gp/researchers/text-and-data-mining": {"version_key": "springernature-tdm", "name_key": "springernature-tdm", "family_key": "publisher-tdm"},

     "https://www.tandfonline.com/action/showCopyRight": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
     "https://www.tandfonline.com/action/showcopyright": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
     "https://tandfonline.com/action/showcopyright": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
     "https://www.tandfonline.com/action/showcopyright?show=full": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},

     "https://us.sagepub.com/en-us/nam/journals-permissions": {"version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary"},
     "https://www.sagepub.com/journalspermissions.nav": {"version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary"},

     "https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {"version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa"},
     "https://pubs.acs.org/page/policy/authorchoice_ccbyncnd_termsofuse.html": {"version_key": "acs-authorchoice-ccbyncnd", "name_key": "acs-authorchoice-ccbyncnd", "family_key": "publisher-oa"},
     "https://pubs.acs.org/page/policy/authorchoice_termsofuse.html": {"version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"},
     "https://pubs.acs.org/page/policy/acs_authorchoice_with_nih_addendum_termsofuse.html": {"version_key": "acs-authorchoice-nih", "name_key": "acs-authorchoice-nih", "family_key": "publisher-oa"},
     "https://doi.org/10.1021/policy/oa-license": {"version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"},

     "https://www.rsc.org/journals-books-databases/journal-authors-reviewers/licences-copyright-permissions/": {"version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary"},
     "https://www.rsc.org/help/disclaimer/pages/term3.aspx": {"version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary"},

     "https://iopscience.iop.org/info/page/text-and-data-mining": {"version_key": "iop-tdm", "name_key": "iop-tdm", "family_key": "publisher-tdm"},
     "https://iopscience.iop.org/page/copyright": {"version_key": "iop-copyright", "name_key": "iop-copyright", "family_key": "publisher-proprietary"},

     "https://www.bmj.com/company/legal-stuff/copyright-notice/": {"version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary"},
     "https://group.bmj.com/group/rights-licensing/permissions": {"version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary"},

     "https://www.science.org/content/page/science-licenses-journal-article-reuse": {"version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary"},
     "https://www.sciencemag.org/about/science-licenses-journal-article-reuse": {"version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary"},

     "https://www.pnas.org/site/aboutpnas/licenses.xhtml": {"version_key": "pnas-licenses", "name_key": "pnas-licenses", "family_key": "publisher-proprietary"},

     "https://link.aps.org/licenses/aps-default-license": {"version_key": "aps-default", "name_key": "aps-default", "family_key": "publisher-proprietary"},
     "https://link.aps.org/licenses/aps-default-text-mining-license": {"version_key": "aps-tdm", "name_key": "aps-tdm", "family_key": "publisher-tdm"},

     "https://www.cambridge.org/core/terms": {"version_key": "cup-terms", "name_key": "cup-terms", "family_key": "publisher-proprietary"},

     "https://publishing.aip.org/authors/rights-and-permissions": {"version_key": "aip-rights", "name_key": "aip-rights", "family_key": "publisher-proprietary"},

     "https://jamanetwork.com/pages/cc-by-license-permissions": {"version_key": "jama-cc-by", "name_key": "jama-cc-by", "family_key": "publisher-oa"},

     "https://www.degruyter.com/dg/page/496": {"version_key": "degruyter-terms", "name_key": "degruyter-terms", "family_key": "publisher-proprietary"},

     "https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model": {"version_key": "oup-chorus", "name_key": "oup-chorus", "family_key": "publisher-oa"},
     "https://academic.oup.com/pages/standard-publication-reuse-rights": {"version_key": "oup-terms", "name_key": "oup-terms", "family_key": "publisher-proprietary"},

     "https://opendatacommons.org/licenses/odbl/1-0/": {"version_key": "odbl", "name_key": "odbl", "family_key": "open-data"},
     "https://opendatacommons.org/licenses/by/1-0/": {"version_key": "odc-by", "name_key": "odc-by", "family_key": "open-data"},
     "https://opendatacommons.org/licenses/pddl/1-0/": {"version_key": "pddl", "name_key": "pddl", "family_key": "open-data"}
   }


src/licence_normaliser/defaults.py
==================================

src/licence_normaliser/defaults.py

   """Default plugin configuration.

   These are the plugin CLASSES (not instances) that form the sane defaults.
   Pass them to LicenseNormaliser - they're instantiated lazily.
   """

   from __future__ import annotations

   from collections.abc import Mapping
   from typing import Iterator

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"

   __all__ = (
       "DEFAULT_PLUGINS",
       "DEFAULT_PLUGIN_KEYS",
       "get_all_refreshable_plugins",
   )

   DEFAULT_PLUGIN_KEYS = ("registry", "url", "alias", "family", "name", "prose")


   def get_all_refreshable_plugins() -> list[type]:
       """Return all plugin classes that support refresh (have url set)."""
       from .parsers.creativecommons import CreativeCommonsParser
       from .parsers.opendefinition import OpenDefinitionParser
       from .parsers.osi import OSIParser
       from .parsers.scancode_licensedb import ScanCodeLicenseDBParser
       from .parsers.spdx import SPDXParser

       return [
           SPDXParser,
           OpenDefinitionParser,
           OSIParser,
           ScanCodeLicenseDBParser,
           CreativeCommonsParser,
       ]


   def _load_registry_plugins() -> list[type]:
       from .parsers.creativecommons import CreativeCommonsParser
       from .parsers.opendefinition import OpenDefinitionParser
       from .parsers.osi import OSIParser
       from .parsers.scancode_licensedb import ScanCodeLicenseDBParser
       from .parsers.spdx import SPDXParser

       return [
           SPDXParser,
           OpenDefinitionParser,
           OSIParser,
           ScanCodeLicenseDBParser,
           CreativeCommonsParser,
       ]


   def _load_url_plugins() -> list[type]:
       from .parsers.creativecommons import CreativeCommonsParser
       from .parsers.opendefinition import OpenDefinitionParser
       from .parsers.osi import OSIParser
       from .parsers.publisher import PublisherParser
       from .parsers.spdx import SPDXParser

       return [
           SPDXParser,
           OpenDefinitionParser,
           OSIParser,
           CreativeCommonsParser,
           PublisherParser,
       ]


   def _load_alias_plugins() -> list[type]:
       from .parsers.alias import AliasParser
       from .parsers.publisher import PublisherParser

       # PublisherParser first, then AliasParser - AliasParser values take precedence
       return [PublisherParser, AliasParser]


   def _load_family_plugins() -> list[type]:
       from .parsers.alias import AliasParser

       return [AliasParser]


   def _load_name_plugins() -> list[type]:
       from .parsers.alias import AliasParser

       return [AliasParser]


   def _load_prose_plugins() -> list[type]:
       from .parsers.prose import ProseParser

       return [ProseParser]


   # Lazy-loaded bundle - functions delay imports until actually needed
   class _LazyDefaults:
       """Lazy-loading container for default plugins."""

       _registry: list[type] | None = None
       _url: list[type] | None = None
       _alias: list[type] | None = None
       _family: list[type] | None = None
       _name: list[type] | None = None
       _prose: list[type] | None = None

       @property
       def registry(self) -> list[type]:
           if self._registry is None:
               self._registry = _load_registry_plugins()
           return self._registry

       @property
       def url(self) -> list[type]:
           if self._url is None:
               self._url = _load_url_plugins()
           return self._url

       @property
       def alias(self) -> list[type]:
           if self._alias is None:
               self._alias = _load_alias_plugins()
           return self._alias

       @property
       def family(self) -> list[type]:
           if self._family is None:
               self._family = _load_family_plugins()
           return self._family

       @property
       def name(self) -> list[type]:
           if self._name is None:
               self._name = _load_name_plugins()
           return self._name

       @property
       def prose(self) -> list[type]:
           if self._prose is None:
               self._prose = _load_prose_plugins()
           return self._prose


   _LAZY = _LazyDefaults()


   # Convenience accessors - these trigger lazy loading
   def get_default_registry() -> list[type]:
       return _LAZY.registry


   def get_default_url() -> list[type]:
       return _LAZY.url


   def get_default_alias() -> list[type]:
       return _LAZY.alias


   def get_default_family() -> list[type]:
       return _LAZY.family


   def get_default_name() -> list[type]:
       return _LAZY.name


   def get_default_prose() -> list[type]:
       return _LAZY.prose


   class _LazyPluginsBundle:
       """Lazy-loading bundle - defers plugin loading until accessed."""

       _cache: dict[str, list[type]] = {}

       def _get_registry(self) -> list[type]:
           return get_default_registry()

       def _get_url(self) -> list[type]:
           return get_default_url()

       def _get_alias(self) -> list[type]:
           return get_default_alias()

       def _get_family(self) -> list[type]:
           return get_default_family()

       def _get_name(self) -> list[type]:
           return get_default_name()

       def _get_prose(self) -> list[type]:
           return get_default_prose()

       def __getitem__(self, key: str) -> list[type]:
           if key not in self._cache:
               fn = getattr(self, f"_get_{key}", None)
               if fn is None:
                   raise KeyError(key)
               self._cache[key] = fn()
           return self._cache[key]


   _DEFAULT_PLUGINS_BUNDLE = _LazyPluginsBundle()


   class _DefaultPlugins(Mapping):
       """Lazy dict-like accessor for default plugins."""

       def __getitem__(self, key: str) -> list[type]:
           return _DEFAULT_PLUGINS_BUNDLE[key]

       def keys(self) -> tuple[str, ...]:
           return DEFAULT_PLUGIN_KEYS

       def values(self) -> list[list[type]]:
           return [self[k] for k in self.keys()]

       def items(self) -> list[tuple[str, list[type]]]:
           return [(k, self[k]) for k in self.keys()]

       def __iter__(self) -> Iterator[str]:
           return iter(self.keys())

       def __len__(self) -> int:
           return 6

       def __contains__(self, key: str) -> bool:
           return key in self.keys()

       def copy(self) -> dict:
           return dict(self.items())


   DEFAULT_PLUGINS = _DefaultPlugins()


src/licence_normaliser/exceptions.py
====================================

src/licence_normaliser/exceptions.py

   """licence_normaliser.exceptions - public exception hierarchy.

   These are the only exceptions that cross the public API boundary.
   All internal errors are wrapped before propagation.
   """

   from __future__ import annotations

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = (
       "DataSourceError",
       "LicenseNormalisationError",
       "LicenseNormaliserError",
       "LicenseNotFoundError",
   )


   class LicenseNormaliserError(Exception):
       """Base exception for all licence-normaliser errors."""


   class LicenseNotFoundError(LicenseNormaliserError):
       """Raised in strict mode when a license string cannot be resolved."""

       def __init__(self, raw: str, cleaned: str) -> None:
           self.raw = raw
           self.cleaned = cleaned
           super().__init__(
               f"License not found: {raw!r} (cleaned: {cleaned!r}). "
               "Pass strict=False to return an 'unknown' result instead."
           )


   class DataSourceError(LicenseNormaliserError):
       """Raised when a data source file cannot be loaded or parsed."""


   class LicenseNormalisationError(ValueError):
       """Raised when ``strict=True`` and no canonical license could be resolved."""


src/licence_normaliser/parsers/__init__.py
==========================================

src/licence_normaliser/parsers/__init__.py


src/licence_normaliser/parsers/alias.py
=======================================

src/licence_normaliser/parsers/alias.py

   """Alias parser - loads aliases.json with rich metadata for aliases/family overrides.

   Each entry may carry an optional ``aliases`` list of extra lookup keys that all
   resolve to the same ``version_key``.  This lets data authors enumerate explicit
   variants (e.g. hyphen vs space forms) without any auto-generation magic::

       "cc by-nc": {
           "version_key": "cc-by-nc",
           "name_key": "cc-by-nc",
           "family_key": "cc",
           "aliases": ["cc-by-nc", "cc by nc", "cc-by nc"]
       }

   All keys in ``aliases`` inherit the same ``version_key``, ``name_key``, and
   ``family_key`` as the primary entry.
   """

   from __future__ import annotations

   import json
   from pathlib import Path
   from typing import Any

   from licence_normaliser.plugins import AliasPlugin, BasePlugin, FamilyPlugin, NamePlugin

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("AliasParser",)


   def _iter_entries(
       data: dict[str, Any],
   ) -> list[tuple[str, dict[str, Any]]]:
       """Yield (key, meta) pairs, expanding ``aliases`` sub-keys.

       For every primary entry that has an ``"aliases"`` list, each alias key is
       emitted as an additional entry with the same metadata dict (minus the
       ``aliases`` field itself, to keep things tidy).
       """
       results: list[tuple[str, dict[str, Any]]] = []
       for primary_key, meta in data.items():
           if primary_key.startswith("_"):
               continue
           if not isinstance(meta, dict):
               continue
           version_key = meta.get("version_key", "")
           if not version_key:
               continue
           results.append((primary_key, meta))

           # Expand explicit alias variants
           for extra_key in meta.get("aliases", []):
               if not isinstance(extra_key, str) or not extra_key:
                   continue
               if extra_key == primary_key:
                   continue  # already emitted
               # Build a slim copy without the aliases list to avoid recursion
               slim_meta = {k: v for k, v in meta.items() if k != "aliases"}
               results.append((extra_key, slim_meta))

       return results


   class AliasParser(BasePlugin, AliasPlugin, FamilyPlugin, NamePlugin):
       url = None
       local_path = "data/aliases/aliases.json"

       def _load_data(self) -> dict[str, Any]:
           path = Path(__file__).parent.parent / self.local_path
           return json.loads(path.read_text(encoding="utf-8"))

       def parse(self) -> list[tuple[str, dict[str, Any]]]:
           return _iter_entries(self._load_data())

       def load_aliases(self) -> dict[str, str]:
           aliases: dict[str, str] = {}
           for alias_key, meta in _iter_entries(self._load_data()):
               version_key = meta.get("version_key", "")
               if version_key:
                   aliases[alias_key] = version_key
           return aliases

       def load_aliases_with_lines(
           self,
       ) -> dict[str, tuple[str, int]]:
           """Load aliases with their source line numbers.

           Extra keys from ``aliases`` lists are reported at the line of their
           primary entry (best approximation without per-alias line tracking).

           Returns:
               dict mapping alias_key -> (version_key, line_number)
           """
           path = Path(__file__).parent.parent / self.local_path
           content = path.read_text(encoding="utf-8")
           data: dict[str, Any] = json.loads(content)
           lines = content.splitlines()
           result: dict[str, tuple[str, int]] = {}

           for primary_key, meta in data.items():
               if primary_key.startswith("_"):
                   continue
               if not isinstance(meta, dict):
                   continue
               version_key = meta.get("version_key", "")
               if not version_key:
                   continue

               # Find line of the primary key
               primary_line = 1
               for i, line in enumerate(lines, start=1):
                   if f'"{primary_key}"' in line:
                       primary_line = i
                       break

               result[primary_key] = (version_key, primary_line)

               for extra_key in meta.get("aliases", []):
                   if not isinstance(extra_key, str) or not extra_key:
                       continue
                   if extra_key == primary_key:
                       continue
                   result[extra_key] = (version_key, primary_line)

           return result

       def load_families(self) -> dict[str, str]:
           data = self._load_data()
           overrides: dict[str, str] = {}
           for meta in data.values():
               if not isinstance(meta, dict):
                   continue
               vk = meta.get("version_key", "")
               fk = meta.get("family_key", "")
               if vk and fk:
                   overrides[vk] = fk
           return overrides

       def load_names(self) -> dict[str, str]:
           data = self._load_data()
           names: dict[str, str] = {}
           for meta in data.values():
               if not isinstance(meta, dict):
                   continue
               vk = meta.get("version_key", "")
               nk = meta.get("name_key", "")
               if vk and nk:
                   names[vk] = nk
           return names


src/licence_normaliser/parsers/creativecommons.py
=================================================

src/licence_normaliser/parsers/creativecommons.py

   """Creative Commons parser - scrapes creativecommons.org for multilingual deed URLs."""

   from __future__ import annotations

   import json
   import re
   import urllib.request
   from html.parser import HTMLParser
   from pathlib import Path
   from typing import Any

   from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

   CC_LICENSE_RE = re.compile(
       r"^(by|by-nc|by-nc-nd|by-nc-sa|by-nd|by-sa|"
       r"zero|pdmark|devnations|"
       r"nc|nd|sa|sampling|nc-sa|sampling\+|nc-sampling\+|nd-nc)"
       r"/([\d.]+)"
       r"(/igo)?"
       r"(/deed\.\w+)?$",
   )
   VERSION_RE = re.compile(r"^[\d.]+$")


   def _path_to_license_key(path: str) -> str | None:
       m = CC_LICENSE_RE.match(path)
       if not m:
           return None
       lic_type, version, igo = m.group(1), m.group(2), m.group(3)

       prefix_map = {
           "by": "cc-by",
           "by-nc": "cc-by-nc",
           "by-nc-nd": "cc-by-nc-nd",
           "by-nc-sa": "cc-by-nc-sa",
           "by-nd": "cc-by-nd",
           "by-sa": "cc-by-sa",
           "zero": "cc0",
           "pdmark": "cc-pdm",
           "devnations": "cc-devnations",
           "nc": "cc-nc",
           "nd": "cc-nd",
           "sa": "cc-sa",
           "sampling": "cc-sampling",
           "nc-sa": "cc-nc-sa",
           "sampling+": "cc-sampling-plus",
           "nc-sampling+": "cc-nc-sampling-plus",
           "nd-nc": "cc-nd-nc",
       }
       prefix = prefix_map.get(lic_type)
       if not prefix:
           return None
       suffix = "igo" if igo else ""
       key = f"{prefix}-{version}" if VERSION_RE.match(version) else prefix
       if suffix:
           key = f"{key}-{suffix}"
       return key.lower()


   class CCLinkParser(HTMLParser):
       def __init__(self) -> None:
           super().__init__()
           self.in_td = False
           self.current_cell = ""
           self.current_row: list[str] = []
           self.rows: list[list[str]] = []

       def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
           if tag == "td":
               self.in_td = True
               self.current_cell = ""
           elif tag == "a" and self.in_td:
               href = dict(attrs).get("href") or ""
               if href:
                   self.current_cell += " AHREF:" + href

       def handle_endtag(self, tag: str) -> None:
           if tag == "td":
               self.in_td = False
               self.current_row.append(self.current_cell.strip())
           elif tag == "tr":
               if self.current_row:
                   self.rows.append(self.current_row)
               self.current_row = []

       def handle_data(self, data: str) -> None:
           if self.in_td:
               self.current_cell += data


   def _fetch_html(url: str) -> str:
       req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
       with urllib.request.urlopen(req, timeout=30) as response:  # noqa: S310
           return response.read().decode("utf-8")


   JURISDICTION_CODES = {
       "au",
       "at",
       "be",
       "br",
       "ca",
       "ch",
       "cl",
       "cn",
       "co",
       "cz",
       "de",
       "dk",
       "ee",
       "eg",
       "es",
       "fi",
       "fr",
       "gb",
       "gr",
       "hr",
       "hu",
       "id",
       "ie",
       "il",
       "in",
       "ir",
       "is",
       "it",
       "jp",
       "kr",
       "lt",
       "lu",
       "lv",
       "ma",
       "mt",
       "mx",
       "my",
       "nl",
       "no",
       "nz",
       "pe",
       "ph",
       "pl",
       "pt",
       "ro",
       "rs",
       "ru",
       "se",
       "si",
       "sk",
       "th",
       "tr",
       "tw",
       "ua",
       "ug",
       "us",
       "za",
       "vn",
   }


   def _is_international(href: str) -> bool:
       parts = href.split("/")
       return not any(p in JURISDICTION_CODES for p in parts[1:])


   def _extract_deeds(html: str) -> set[str]:
       parser = CCLinkParser()
       parser.feed(html)
       deeds: set[str] = set()
       for row in parser.rows:
           if not row:
               continue
           jurisdiction = row[0]
           if jurisdiction != "English":
               continue
           for cell in row[1:]:
               for part in cell.split():
                   if part.startswith("AHREF:"):
                       href = part[6:]
                       if href and _is_international(href):
                           deeds.add(href)
       return deeds


   def _scrape() -> list[dict[str, str]]:
       pages = [
           "https://creativecommons.org/licenses/list.en",
           "https://creativecommons.org/publicdomain/list.en",
       ]
       all_deeds: set[str] = set()
       try:
           for page_url in pages:
               html = _fetch_html(page_url)
               all_deeds |= _extract_deeds(html)
       except Exception:
           pass

       entries: list[dict[str, str]] = []
       seen_keys: set[str] = set()
       for href in sorted(all_deeds):
           lic_key = _path_to_license_key(href)
           if not lic_key:
               continue
           url_path = href.rsplit("/deed.", 1)[0]
           url = f"https://creativecommons.org/licenses/{url_path}/"
           if lic_key in seen_keys:
               continue
           seen_keys.add(lic_key)
           entries.append({"license_key": lic_key, "url": url, "path": url_path})

       return entries


   class CreativeCommonsParser(BasePlugin, RegistryPlugin, URLPlugin):
       id = "creativecommons"
       url = "https://creativecommons.org/licenses/list.en"
       local_path = "data/creativecommons/creativecommons.json"

       def parse(self) -> list[tuple[str, dict[str, Any]]]:
           path = Path(__file__).parent.parent / self.local_path
           if not path.exists():
               return []
           data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
           return [
               (
                   entry["license_key"],
                   {
                       "url": entry["url"],
                       "name": entry["license_key"],
                       "path": entry["path"],
                   },
               )
               for entry in data
               if "license_key" in entry
           ]

       def load_registry(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           if not path.exists():
               return {}
           data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           for entry in data:
               key = entry.get("license_key", "")
               if key:
                   result[key.lower().strip()] = key.lower().strip()
           return result

       def load_urls(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           if not path.exists():
               return {}
           data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           for entry in data:
               key = entry.get("license_key", "")
               if not key:
                   continue
               canonical = key.lower().strip()
               raw_url = entry.get("url", "")
               if not raw_url:
                   continue
               clean = raw_url.strip().lower().rstrip("/")
               if clean.startswith("http://"):
                   clean = "https://" + clean[7:]
               result[clean] = canonical
           return result

       @classmethod
       def refresh(cls, force: bool = False) -> bool:
           target = Path(__file__).parent.parent / cls.local_path
           if target.exists() and not force:
               return True
           try:
               data = _scrape()
               target.parent.mkdir(parents=True, exist_ok=True)
               target.write_text(
                   json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8"
               )
               return True
           except Exception:
               return False


src/licence_normaliser/parsers/opendefinition.py
================================================

src/licence_normaliser/parsers/opendefinition.py

   """OpenDefinition parser - loads opendefinition_licenses_all.json from package data."""

   from __future__ import annotations

   import json
   from pathlib import Path
   from typing import Any

   from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("OpenDefinitionParser",)


   class OpenDefinitionParser(BasePlugin, RegistryPlugin, URLPlugin):
       id = "opendefinition"
       url = "https://licenses.opendefinition.org/licenses/groups/all.json"
       local_path = "data/opendefinition/opendefinition.json"

       def parse(self) -> list[tuple[str, dict[str, Any]]]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           results: list[tuple[str, dict[str, Any]]] = []
           for entry in data.values():
               if not isinstance(entry, dict):
                   continue
               lid = entry.get("id", "")
               url = entry.get("url", "")
               results.append((lid, {"url": url, "title": entry.get("title", "")}))
           return results

       def load_registry(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           for entry in data.values():
               if not isinstance(entry, dict):
                   continue
               lid = entry.get("id", "")
               if lid:
                   result[lid.lower().strip()] = lid.lower().strip()
           return result

       def load_urls(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           for entry in data.values():
               if not isinstance(entry, dict):
                   continue
               lid = entry.get("id", "")
               if not lid:
                   continue
               canonical = lid.lower().strip()
               raw_url = entry.get("url", "")
               if not raw_url:
                   continue
               clean = raw_url.strip().lower().rstrip("/")
               if clean.startswith("http://"):
                   clean = "https://" + clean[7:]
               result[clean] = canonical
           return result


src/licence_normaliser/parsers/osi.py
=====================================

src/licence_normaliser/parsers/osi.py

   """OSI parser - loads osi.json from package data."""

   from __future__ import annotations

   import json
   from pathlib import Path
   from typing import Any

   from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("OSIParser",)


   class OSIParser(BasePlugin, RegistryPlugin, URLPlugin):
       id = "osi"
       url = "https://opensource.org/api/license"
       local_path = "data/osi/osi.json"

       def parse(self) -> list[tuple[str, dict[str, Any]]]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           results: list[tuple[str, dict[str, Any]]] = []
           if not isinstance(data, list):
               return results
           for entry in data:
               if not isinstance(entry, dict):
                   continue
               key = entry.get("id", "")
               if not key:
                   continue
               links = entry.get("_links", {})
               html_link = links.get("html", {})
               url = html_link.get("href", "") if isinstance(html_link, dict) else ""
               results.append(
                   (
                       key,
                       {
                           "url": url,
                           "name": entry.get("name", ""),
                           "spdx_id": entry.get("spdx_id", ""),
                       },
                   )
               )
           return results

       def load_registry(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           if not isinstance(data, list):
               return result
           for entry in data:
               if not isinstance(entry, dict):
                   continue
               key = entry.get("id", "").strip()
               if key:
                   result[key.lower()] = key.lower()
           return result

       def load_urls(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           if not isinstance(data, list):
               return result
           for entry in data:
               if not isinstance(entry, dict):
                   continue
               key = entry.get("id", "").strip()
               if not key:
                   continue
               canonical = key.lower()
               links = entry.get("_links", {})
               html_link = links.get("html", {})
               raw_url = html_link.get("href", "") if isinstance(html_link, dict) else ""
               if not raw_url:
                   continue
               clean = raw_url.strip().lower().rstrip("/")
               if clean.startswith("http://"):
                   clean = "https://" + clean[7:]
               result[clean] = canonical
           return result


src/licence_normaliser/parsers/prose.py
=======================================

src/licence_normaliser/parsers/prose.py

   """Prose pattern parser - loads prose_patterns.json and compiles regex patterns."""

   from __future__ import annotations

   import json
   import re
   from pathlib import Path
   from typing import Any

   from licence_normaliser.plugins import BasePlugin, ProsePlugin

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("ProseParser",)

   _COMPILED_PATTERNS: list[tuple[re.Pattern[str], str]] = []


   class ProseParser(BasePlugin, ProsePlugin):
       is_registry_entry = False
       url = None
       local_path = "data/prose/prose_patterns.json"

       def parse(self) -> list[tuple[str, dict[str, Any]]]:
           path = Path(__file__).parent.parent / self.local_path
           data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
           global _COMPILED_PATTERNS
           _COMPILED_PATTERNS = []
           results: list[tuple[str, dict[str, Any]]] = []
           for entry in data:
               pattern_str = entry.get("pattern", "")
               version_key = entry.get("version_key", "")
               name_key = entry.get("name_key", "")
               family_key = entry.get("family_key", "")
               if pattern_str and version_key:
                   compiled = re.compile(pattern_str, re.IGNORECASE)
                   _COMPILED_PATTERNS.append((compiled, version_key))
                   results.append(
                       (
                           pattern_str,
                           {
                               "pattern": compiled,
                               "version_key": version_key,
                               "name_key": name_key,
                               "family_key": family_key,
                           },
                       )
                   )
           return results

       def load_prose(self) -> list[tuple[re.Pattern[str], str]]:
           global _COMPILED_PATTERNS
           _COMPILED_PATTERNS = []
           path = Path(__file__).parent.parent / self.local_path
           data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
           for entry in data:
               pattern_str = entry.get("pattern", "")
               version_key = entry.get("version_key", "")
               if pattern_str and version_key:
                   compiled = re.compile(pattern_str, re.IGNORECASE)
                   _COMPILED_PATTERNS.append((compiled, version_key))
           return _COMPILED_PATTERNS

       def load_prose_with_lines(self) -> list[tuple[re.Pattern[str], str, int]]:
           """Load prose patterns with their source line numbers.

           Returns:
               list of (compiled_pattern, version_key, line_number)
           """
           path = Path(__file__).parent.parent / self.local_path
           content = path.read_text(encoding="utf-8")
           data: list[dict[str, str]] = json.loads(content)
           lines = content.splitlines()
           result: list[tuple[re.Pattern[str], str, int]] = []
           for entry in data:
               pattern_str = entry.get("pattern", "")
               version_key = entry.get("version_key", "")
               if pattern_str and version_key:
                   compiled = re.compile(pattern_str, re.IGNORECASE)
                   serialized = json.dumps(pattern_str)
                   line_num = 1
                   for i, line in enumerate(lines, start=1):
                       if '"pattern"' in line and serialized[:30] in line:
                           line_num = i
                           break
                   result.append((compiled, version_key, line_num))
           return result


   def get_prose_patterns() -> list[tuple[re.Pattern[str], str]]:
       """Legacy helper: return the compiled prose patterns."""
       return _COMPILED_PATTERNS


src/licence_normaliser/parsers/publisher.py
===========================================

src/licence_normaliser/parsers/publisher.py

   """Publisher parser - loads publishers.json with URLs and shorthand aliases."""

   from __future__ import annotations

   import json
   from pathlib import Path
   from typing import Any

   from licence_normaliser.plugins import AliasPlugin, BasePlugin, URLPlugin

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("PublisherParser",)


   class PublisherParser(BasePlugin, AliasPlugin, URLPlugin):
       url = None
       local_path = "data/publishers/publishers.json"

       def parse(self) -> list[tuple[str, dict[str, Any]]]:
           path = Path(__file__).parent.parent / self.local_path
           data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8"))
           results: list[tuple[str, dict[str, Any]]] = []
           urls: dict[str, dict[str, str]] = data.get("urls", {})
           for url, meta in urls.items():
               if isinstance(meta, dict):
                   results.append((url, meta))
           return results

       def load_aliases(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8"))
           aliases: dict[str, str] = data.get("shorthand_aliases", {})
           return dict(aliases)

       def load_aliases_with_lines(self) -> dict[str, tuple[str, int]]:
           """Load shorthand aliases with their source line numbers."""
           path = Path(__file__).parent.parent / self.local_path
           content = path.read_text(encoding="utf-8")
           data: dict[str, Any] = json.loads(content)
           lines = content.splitlines()
           result: dict[str, tuple[str, int]] = {}
           for alias_key, version_key in data.get("shorthand_aliases", {}).items():
               for i, line in enumerate(lines, start=1):
                   if f'"{alias_key}"' in line:
                       result[alias_key] = (version_key, i)
                       break
           return result

       def load_urls(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           for url, meta in data.get("urls", {}).items():
               if not isinstance(meta, dict):
                   continue
               vk = meta.get("version_key", "")
               if not vk:
                   continue
               clean = url.strip().lower().rstrip("/")
               if clean.startswith("http://"):
                   clean = "https://" + clean[7:]
               result[clean] = vk
           return result

       def load_urls_with_lines(self) -> dict[str, tuple[str, int]]:
           """Load URLs with their source line numbers."""
           path = Path(__file__).parent.parent / self.local_path
           content = path.read_text(encoding="utf-8")
           data: dict[str, Any] = json.loads(content)
           lines = content.splitlines()
           result: dict[str, tuple[str, int]] = {}
           for url, meta in data.get("urls", {}).items():
               if not isinstance(meta, dict):
                   continue
               vk = meta.get("version_key", "")
               if not vk:
                   continue
               clean = url.strip().lower().rstrip("/")
               if clean.startswith("http://"):
                   clean = "https://" + clean[7:]
               for i, line in enumerate(lines, start=1):
                   if f'"{url}"' in line or f'"{clean}"' in line:
                       result[clean] = (vk, i)
                       break
           return result


src/licence_normaliser/parsers/scancode_licensedb.py
====================================================

src/licence_normaliser/parsers/scancode_licensedb.py

   """ScanCode-licensedb parser - loads scancode_licensedb.json from package data."""

   from __future__ import annotations

   import json
   from pathlib import Path
   from typing import Any

   from licence_normaliser.plugins import BasePlugin, RegistryPlugin

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("ScanCodeLicenseDBParser",)


   class ScanCodeLicenseDBParser(BasePlugin, RegistryPlugin):
       id = "scancode-licensedb"
       url = "https://scancode-licensedb.aboutcode.org/index.json"
       local_path = "data/scancode_licensedb/scancode_licensedb.json"

       def parse(self) -> list[tuple[str, dict[str, Any]]]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           results: list[tuple[str, dict[str, Any]]] = []
           if not isinstance(data, list):
               return results
           for entry in data:
               if not isinstance(entry, dict):
                   continue
               key = entry.get("license_key", "")
               if not key:
                   continue
               if key.lower() == "unknown":
                   continue
               spdx_key = entry.get("spdx_license_key")
               category = entry.get("category", "")
               results.append(
                   (
                       key,
                       {
                           "url": "",
                           "name": key,
                           "category": category,
                           "spdx_license_key": spdx_key if spdx_key else "",
                       },
                   )
               )
           return results

       def load_registry(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           if not isinstance(data, list):
               return result
           for entry in data:
               if not isinstance(entry, dict):
                   continue
               key = entry.get("license_key", "")
               if key and key.lower() != "unknown":
                   result[key.lower().strip()] = key.lower().strip()
           return result


src/licence_normaliser/parsers/spdx.py
======================================

src/licence_normaliser/parsers/spdx.py

   """SPDX parser - loads spdx-licenses.json from package data."""

   from __future__ import annotations

   import json
   from pathlib import Path
   from typing import Any

   from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"
   __all__ = ("SPDXParser",)


   class SPDXParser(BasePlugin, RegistryPlugin, URLPlugin):
       id = "spdx"
       url = "https://raw.githubusercontent.com/spdx/license-list-data/main/json/licenses.json"
       local_path = "data/spdx/spdx.json"

       def parse(self) -> list[tuple[str, dict[str, Any]]]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           results: list[tuple[str, dict[str, Any]]] = []
           for entry in data.get("licenses", []):
               if not isinstance(entry, dict):
                   continue
               lid = entry.get("licenseId", "")
               urls = entry.get("seeAlso", [])
               url = urls[0] if urls else ""
               results.append((lid, {"url": url, "name": entry.get("name", "")}))
           return results

       def load_registry(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           for entry in data.get("licenses", []):
               if not isinstance(entry, dict):
                   continue
               lid = entry.get("licenseId", "")
               if lid:
                   result[lid.lower().strip()] = lid.lower().strip()
           return result

       def load_urls(self) -> dict[str, str]:
           path = Path(__file__).parent.parent / self.local_path
           data = json.loads(path.read_text(encoding="utf-8"))
           result: dict[str, str] = {}
           for entry in data.get("licenses", []):
               if not isinstance(entry, dict):
                   continue
               lid = entry.get("licenseId", "")
               if not lid:
                   continue
               canonical = lid.lower().strip()
               for raw_url in entry.get("seeAlso", []):
                   if not raw_url:
                       continue
                   clean = raw_url.strip().lower().rstrip("/")
                   if clean.startswith("http://"):
                       clean = "https://" + clean[7:]
                   result[clean] = canonical
           return result


src/licence_normaliser/plugins.py
=================================

src/licence_normaliser/plugins.py

   """Simple plugin interface definitions.

   Each plugin is a callable that returns a dict or list of tuples.
   Plugins are passed as CLASSES (not instances) - they're instantiated lazily.
   """

   from __future__ import annotations

   import json
   import logging
   import re
   import urllib.error
   import urllib.request
   from pathlib import Path

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"

   __all__ = (
       "AliasPlugin",
       "BasePlugin",
       "FamilyPlugin",
       "NamePlugin",
       "ProsePlugin",
       "RegistryPlugin",
       "URLPlugin",
   )


   class BasePlugin:
       """Base class for all plugins with refresh capability."""

       url: str | None = None
       local_path: str = ""

       @classmethod
       def refresh(cls, force: bool = False) -> bool:
           """Fetch fresh data from ``cls.url`` and write to ``cls.local_path``.

           The local path is resolved relative to the package root
           (``src/licence_normaliser/``).

           If ``cls.url`` is None, this is a local-only parser with no external
           source and the operation succeeds without fetching.

           Returns True on success, False on failure.
           """
           if not cls.local_path:
               return False
           target = Path(__file__).parent / cls.local_path
           if target.exists() and not force:
               return True
           if cls.url is None:
               return True
           try:
               with urllib.request.urlopen(cls.url, timeout=30) as response:  # noqa: S310
                   raw_bytes = response.read()
               json.loads(raw_bytes.decode("utf-8"))
               target.parent.mkdir(parents=True, exist_ok=True)
               target.write_bytes(raw_bytes)
               return True
           except urllib.error.URLError as exc:
               logging.warning(
                   "refresh(%s): URLError fetching %s - %s", cls.__name__, cls.url, exc
               )
               return False
           except urllib.error.HTTPError as exc:
               logging.warning(
                   "refresh(%s): HTTPError %s fetching %s", cls.__name__, exc.code, cls.url
               )
               return False
           except json.JSONDecodeError as exc:
               logging.error(
                   "refresh(%s): invalid JSON from %s - %s", cls.__name__, cls.url, exc
               )
               return False
           except OSError as exc:
               logging.error(
                   "refresh(%s): OSError writing %s - %s", cls.__name__, target, exc
               )
               return False


   class RegistryPlugin:
       """Returns key -> canonical_key mappings."""

       def load_registry(self) -> dict[str, str]:
           raise NotImplementedError


   class URLPlugin:
       """Returns cleaned_url -> version_key mappings."""

       def load_urls(self) -> dict[str, str]:
           raise NotImplementedError


   class AliasPlugin:
       """Returns alias_string -> version_key mappings."""

       def load_aliases(self) -> dict[str, str]:
           raise NotImplementedError


   class FamilyPlugin:
       """Returns version_key -> family_key mappings."""

       def load_families(self) -> dict[str, str]:
           raise NotImplementedError


   class NamePlugin:
       """Returns version_key -> name_key mappings."""

       def load_names(self) -> dict[str, str]:
           raise NotImplementedError


   class ProsePlugin:
       """Returns list of (compiled_pattern, version_key) for prose matching."""

       def load_prose(self) -> list[tuple[re.Pattern[str], str]]:
           raise NotImplementedError


src/licence_normaliser/tests/__init__.py
========================================

src/licence_normaliser/tests/__init__.py

   """Tests for licence_normaliser."""

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"


src/licence_normaliser/tests/conftest.py
========================================

src/licence_normaliser/tests/conftest.py

   """Shared fixtures for licence_normaliser tests."""

   import pytest

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"


   @pytest.fixture()
   def mit_raw() -> str:
       return "MIT"


   @pytest.fixture()
   def cc_by_nc_nd_4_raw() -> str:
       return "CC BY-NC-ND 4.0"


   @pytest.fixture()
   def batch_raw() -> list[str]:
       return ["MIT", "Apache-2.0", "CC BY 4.0"]


src/licence_normaliser/tests/test_aliases.py
============================================

src/licence_normaliser/tests/test_aliases.py

   """Tests for AliasParser - non-CC aliases (Apache, MIT, BSD, GPL, etc.)."""

   from licence_normaliser import normalise_license


   class TestNonCCAliases:
       def test_apache_shorthand(self):
           v = normalise_license("apache")
           assert v.key == "apache-2.0"
           assert v.family.key == "osi"

       def test_apache_license(self):
           v = normalise_license("apache license")
           assert v.key == "apache-2.0"
           assert v.family.key == "osi"

       def test_apache_2(self):
           v = normalise_license("apache 2")
           assert v.key == "apache-2.0"
           assert v.family.key == "osi"

       def test_apache_2_0(self):
           v = normalise_license("apache 2.0")
           assert v.key == "apache-2.0"
           assert v.family.key == "osi"

       def test_mit_license(self):
           v = normalise_license("mit license")
           assert v.key == "mit"
           assert v.family.key == "osi"

       def test_the_mit_license(self):
           v = normalise_license("the mit license")
           assert v.key == "mit"
           assert v.family.key == "osi"

       def test_bsd_shorthand(self):
           v = normalise_license("bsd")
           assert v.key == "bsd-3-clause"
           assert v.family.key == "osi"

       def test_bsd_license(self):
           v = normalise_license("bsd license")
           assert v.key == "bsd-3-clause"
           assert v.family.key == "osi"

       def test_mozilla(self):
           v = normalise_license("mozilla")
           assert v.key == "mpl-2.0"
           assert v.family.key == "osi"

       def test_isc_license(self):
           v = normalise_license("isc license")
           assert v.key == "isc"
           assert v.family.key == "osi"

       def test_gpl_shorthand(self):
           v = normalise_license("gpl")
           assert v.key == "gpl-3.0"
           assert v.family.key == "copyleft"

       def test_gnu_gpl(self):
           v = normalise_license("gnu gpl")
           assert v.key == "gpl-3.0"
           assert v.family.key == "copyleft"

       def test_gnu_gpl_v2(self):
           v = normalise_license("gnu gpl v2")
           assert v.key == "gpl-2.0"
           assert v.family.key == "copyleft"

       def test_gpl_3_0_or_later(self):
           v = normalise_license("gpl-3.0+")
           assert v.key == "gpl-3.0"
           assert v.family.key == "copyleft"

       def test_gpl_2_0_or_later(self):
           v = normalise_license("gpl-2.0+")
           assert v.key == "gpl-2.0"
           assert v.family.key == "copyleft"

       def test_agpl_shorthand(self):
           v = normalise_license("agpl")
           assert v.key == "agpl-3.0"
           assert v.family.key == "copyleft"

       def test_agpl_3_0_or_later(self):
           v = normalise_license("agpl-3.0+")
           assert v.key == "agpl-3.0"
           assert v.family.key == "copyleft"

       def test_lgpl_shorthand(self):
           v = normalise_license("lgpl")
           assert v.key == "lgpl-3.0"
           assert v.family.key == "copyleft"

       def test_lgpl_2_1_or_later(self):
           v = normalise_license("lgpl-2.1+")
           assert v.key == "lgpl-2.1"
           assert v.family.key == "copyleft"

       def test_lgpl_3_0_or_later(self):
           v = normalise_license("lgpl-3.0+")
           assert v.key == "lgpl-3.0"
           assert v.family.key == "copyleft"

       def test_unlicense(self):
           v = normalise_license("unlicense")
           assert v.key == "unlicense"
           assert v.family.key == "osi"

       def test_wtfpl(self):
           v = normalise_license("wtfpl")
           assert v.key == "wtfpl"
           assert v.family.key == "osi"

       def test_zlib(self):
           v = normalise_license("zlib")
           assert v.key == "zlib"
           assert v.family.key == "osi"

       def test_open_database_license(self):
           v = normalise_license("open database license")
           assert v.key == "odbl"
           assert v.family.key == "open-data"

       def test_public_domain(self):
           v = normalise_license("public domain")
           assert v.key == "public-domain"
           assert v.family.key == "public-domain"

       def test_pd_alias(self):
           v = normalise_license("pd")
           assert v.key == "public-domain"
           assert v.family.key == "public-domain"


src/licence_normaliser/tests/test_cache.py
==========================================

src/licence_normaliser/tests/test_cache.py

   """Tests for _cache.py - thread-safe default normaliser singleton."""

   from __future__ import annotations

   import threading
   from concurrent.futures import ThreadPoolExecutor

   from licence_normaliser._cache import (
       _DefaultNormaliser,
       get_registry_keys,
       normalise_license,
       normalise_licenses,
   )
   from licence_normaliser._normaliser import LicenseNormaliser


   class TestDefaultNormaliserSingleton:
       def test_singleton_instance_reused(self) -> None:
           d1 = _DefaultNormaliser()
           d2 = _DefaultNormaliser()
           assert d1.get() is d2.get()

       def test_get_returns_licence_normaliser(self) -> None:
           d = _DefaultNormaliser()
           instance = d.get()
           assert isinstance(instance, LicenseNormaliser)

       def test_thread_safety_same_instance(self) -> None:
           results: list[object | None] = [None] * 20
           errors: list[BaseException | None] = [None] * 20

           def get_instance(idx: int) -> None:
               try:
                   d = _DefaultNormaliser()
                   results[idx] = d.get()
               except BaseException as e:  # noqa: BLE001
                   errors[idx] = e

           threads = [threading.Thread(target=get_instance, args=(i,)) for i in range(20)]
           for t in threads:
               t.start()
           for t in threads:
               t.join()

           assert all(e is None for e in errors)
           assert results[0] is not None
           assert all(r is results[0] for r in results if r is not None)

       def test_concurrent_normalise_license(self) -> None:
           licenses = ["MIT", "Apache-2.0", "CC BY 4.0", "GPL-3.0", "BSD-3-Clause"]

           def normalise(lic: str) -> str:
               v = normalise_license(lic)
               return v.key

           with ThreadPoolExecutor(max_workers=10) as executor:
               futures = [executor.submit(normalise, lic) for lic in licenses * 4]
               results = [f.result(timeout=5) for f in futures]

           assert len(results) == len(licenses) * 4
           assert set(results) == {
               "mit",
               "apache-2.0",
               "cc-by-4.0",
               "gpl-3.0",
               "bsd-3-clause",
           }


   class TestModuleLevelAPI:
       def test_normalise_license_returns_license_version(self) -> None:
           v = normalise_license("MIT")
           assert str(v) == "mit"

       def test_normalise_licenses_returns_list(self) -> None:
           results = normalise_licenses(["MIT", "Apache-2.0"])
           assert len(results) == 2
           assert all(str(r) in ("mit", "apache-2.0") for r in results)

       def test_get_registry_keys_returns_set_of_strings(self) -> None:
           keys = get_registry_keys()
           assert isinstance(keys, set)
           assert len(keys) > 1000
           assert "mit" in keys
           assert "apache-2.0" in keys


src/licence_normaliser/tests/test_cli.py
========================================

src/licence_normaliser/tests/test_cli.py

   """Tests for licence_normaliser CLI - includes new --strict flag."""

   from unittest.mock import patch

   import pytest

   from licence_normaliser.cli._main import main

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"


   class TestNormaliseCommand:
       def test_normalise_mit(self, capsys):
           with patch("sys.argv", ["licence-normaliser", "normalise", "MIT"]):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 0
           assert capsys.readouterr().out.strip() == "mit"

       def test_normalise_full(self, capsys):
           with patch(
               "sys.argv", ["licence-normaliser", "normalise", "--full", "CC BY 4.0"]
           ):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 0
           out = capsys.readouterr().out
           assert "Key: cc-by-4.0" in out
           assert "License: cc-by" in out
           assert "Family: cc" in out

       def test_normalise_cc_url(self, capsys):
           with patch(
               "sys.argv",
               [
                   "licence-normaliser",
                   "normalise",
                   "http://creativecommons.org/licenses/by/4.0/",
               ],
           ):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 0
           assert capsys.readouterr().out.strip() == "cc-by-4.0"

       def test_normalise_unknown(self, capsys):
           with patch(
               "sys.argv", ["licence-normaliser", "normalise", "totally-unknown-xyz"]
           ):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 0
           assert "totally-unknown-xyz" in capsys.readouterr().out

       def test_normalise_strict_known(self, capsys):
           with patch("sys.argv", ["licence-normaliser", "normalise", "--strict", "MIT"]):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 0
           assert capsys.readouterr().out.strip() == "mit"

       def test_normalise_strict_unknown_exits_1(self, capsys):
           with patch(
               "sys.argv",
               ["licence-normaliser", "normalise", "--strict", "totally-unknown-xyz-9999"],
           ):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 1
           assert capsys.readouterr().err  # error message on stderr


   class TestBatchCommand:
       def test_batch_basic(self, capsys):
           with patch(
               "sys.argv",
               ["licence-normaliser", "batch", "MIT", "Apache-2.0", "CC BY 4.0"],
           ):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 0
           out = capsys.readouterr().out
           assert "MIT: mit" in out
           assert "Apache-2.0: apache-2.0" in out
           assert "CC BY 4.0: cc-by-4.0" in out

       def test_batch_strict_all_known(self, capsys):
           with patch(
               "sys.argv", ["licence-normaliser", "batch", "--strict", "MIT", "GPL-3.0"]
           ):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 0

       def test_batch_strict_with_unknown_exits_1(self, capsys):
           with patch(
               "sys.argv",
               ["licence-normaliser", "batch", "--strict", "MIT", "no-such-license-xyz"],
           ):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 1


   class TestVersionFlag:
       def test_version_flag(self, capsys):
           with patch("sys.argv", ["licence-normaliser", "--version"]):
               with pytest.raises(SystemExit) as exc_info:
                   main()
               assert exc_info.value.code == 0
           assert "licence-normaliser" in capsys.readouterr().out


src/licence_normaliser/tests/test_core.py
=========================================

src/licence_normaliser/tests/test_core.py

   """End-to-end pipeline tests via the public API."""

   from licence_normaliser import normalise_license, normalise_licenses

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"


   class TestDirectLookup:
       def test_mit(self):
           v = normalise_license("mit")
           assert v.key == "mit"
           assert v.family.key == "osi"

       def test_apache(self):
           v = normalise_license("apache-2.0")
           assert v.key == "apache-2.0"
           assert v.family.key == "osi"

       def test_cc_by_4_0(self):
           v = normalise_license("cc-by-4.0")
           assert v.key == "cc-by-4.0"
           assert v.family.key == "cc"

       def test_cc_by_nc_nd_4_0(self):
           v = normalise_license("cc-by-nc-nd-4.0")
           assert v.key == "cc-by-nc-nd-4.0"
           assert v.family.key == "cc"

       def test_cc0_1_0(self):
           v = normalise_license("cc0-1.0")
           assert v.key == "cc0-1.0"
           assert v.family.key == "cc0"

       def test_gpl_3_0(self):
           v = normalise_license("gpl-3.0")
           assert v.key == "gpl-3.0"
           assert v.family.key == "copyleft"

       def test_gpl_2_0_only(self):
           v = normalise_license("gpl-2.0-only")
           assert v.key == "gpl-2.0-only"
           assert v.family.key == "copyleft"

       def test_lgpl_2_1(self):
           v = normalise_license("lgpl-2.1")
           assert v.key == "lgpl-2.1"
           assert v.family.key == "copyleft"

       def test_agpl_3_0(self):
           v = normalise_license("agpl-3.0")
           assert v.key == "agpl-3.0"
           assert v.family.key == "copyleft"

       def test_bsd_3_clause(self):
           v = normalise_license("bsd-3-clause")
           assert v.key == "bsd-3-clause"
           assert v.family.key == "osi"

       def test_isc(self):
           v = normalise_license("isc")
           assert v.key == "isc"
           assert v.family.key == "osi"

       def test_mpl_2_0(self):
           v = normalise_license("mpl-2.0")
           assert v.key == "mpl-2.0"
           assert v.family.key == "osi"

       def test_unlicense(self):
           v = normalise_license("unlicense")
           assert v.key == "unlicense"
           assert v.family.key == "osi"

       def test_wtfpl(self):
           v = normalise_license("wtfpl")
           assert v.key == "wtfpl"
           assert v.family.key == "osi"

       def test_zlib(self):
           v = normalise_license("zlib")
           assert v.key == "zlib"
           assert v.family.key == "osi"

       def test_odbl_1_0(self):
           v = normalise_license("odbl-1.0")
           assert v.key == "odbl-1.0"
           assert v.family.key == "open-data"

       def test_pddl_1_0(self):
           v = normalise_license("pddl-1.0")
           assert v.key == "pddl-1.0"
           assert v.family.key == "data"

       def test_odc_by_1_0(self):
           v = normalise_license("odc-by-1.0")
           assert v.key == "odc-by-1.0"
           assert v.family.key == "open-data"

       def test_unknown(self):
           v = normalise_license("unknown")
           assert v.key == "unknown"
           assert v.family.key == "unknown"

       def test_case_insensitive(self):
           v = normalise_license("MIT")
           assert v.key == "mit"
           v = normalise_license("Apache-2.0")
           assert v.key == "apache-2.0"


   class TestBuiltinAliases:
       def test_cc_by(self):
           assert normalise_license("CC BY").key == "cc-by"

       def test_cc_by_4_0(self):
           assert normalise_license("CC BY 4.0").key == "cc-by-4.0"

       def test_cc_by_nc_nd_4_0(self):
           assert normalise_license("CC BY-NC-ND 4.0").key == "cc-by-nc-nd-4.0"

       def test_cc_by_nc_sa_4_0(self):
           assert normalise_license("CC BY-NC-SA 4.0").key == "cc-by-nc-sa-4.0"

       def test_cc0_1_0(self):
           assert normalise_license("CC0 1.0").key == "cc0-1.0"

       def test_public_domain(self):
           assert normalise_license("public domain").key == "public-domain"


   class TestUrlLookup:
       def test_cc_by_https(self):
           v = normalise_license("https://creativecommons.org/licenses/by/4.0/")
           assert v.key == "cc-by-4.0"

       def test_cc_by_http(self):
           v = normalise_license("http://creativecommons.org/licenses/by/4.0/")
           assert v.key == "cc-by-4.0"

       def test_cc_by_no_trailing_slash(self):
           v = normalise_license("https://creativecommons.org/licenses/by/4.0")
           assert v.key == "cc-by-4.0"

       def test_mit_url(self):
           v = normalise_license("https://opensource.org/licenses/MIT")
           assert v.key == "mit"


   class TestFamilyInference:
       def test_cc_family(self):
           v = normalise_license("cc-by-4.0")
           assert v.family.key == "cc"

       def test_cc0_family(self):
           v = normalise_license("cc0-1.0")
           assert v.family.key == "cc0"

       def test_copyleft_family(self):
           assert normalise_license("gpl-3.0").family.key == "copyleft"
           assert normalise_license("agpl-3.0").family.key == "copyleft"
           assert normalise_license("lgpl-2.1").family.key == "copyleft"

       def test_osi_family(self):
           assert normalise_license("mit").family.key == "osi"
           assert normalise_license("apache-2.0").family.key == "osi"
           assert normalise_license("bsd-3-clause").family.key == "osi"

       def test_data_family(self):
           assert normalise_license("pddl-1.0").family.key == "data"


   class TestNameInference:
       def test_cc_name_strips_version(self):
           assert normalise_license("cc-by-4.0").license.key == "cc-by"
           assert normalise_license("cc-by-nc-nd-4.0").license.key == "cc-by-nc-nd"
           assert normalise_license("cc-by-sa-3.0").license.key == "cc-by-sa"
           assert normalise_license("cc0-1.0").license.key == "cc0"
           assert normalise_license("cc-by-nc-sa-4.0").license.key == "cc-by-nc-sa"

       def test_non_cc_keeps_key(self):
           assert normalise_license("mit").license.key == "mit"
           assert normalise_license("gpl-3.0").license.key == "gpl-3"


   class TestHierarchyNavigation:
       def test_version_license_family_chain(self):
           v = normalise_license("CC BY-NC-ND 4.0")
           assert v.key == "cc-by-nc-nd-4.0"
           assert v.license.key == "cc-by-nc-nd"
           assert v.license.family.key == "cc"
           assert v.family.key == "cc"

       def test_str_representations(self):
           v = normalise_license("CC BY-NC-ND 4.0")
           assert str(v) == "cc-by-nc-nd-4.0"
           assert str(v.license) == "cc-by-nc-nd"
           assert str(v.family) == "cc"


   class TestFallback:
       def test_unknown_string(self):
           v = normalise_license("some-totally-unknown-license-xyz")
           assert v.key == "some-totally-unknown-license-xyz"
           assert v.family.key == "unknown"

       def test_empty_string(self):
           v = normalise_license("")
           assert v.key == "unknown"

       def test_whitespace_only(self):
           v = normalise_license("   ")
           assert v.key == "unknown"


   class TestBatchNormalisation:
       def test_basic_batch(self):
           results = normalise_licenses(["MIT", "Apache-2.0", "CC BY 4.0"])
           assert [r.key for r in results] == ["mit", "apache-2.0", "cc-by-4.0"]

       def test_batch_preserves_order(self):
           raw = ["GPL-3.0", "MIT", "CC BY 4.0", "Apache-2.0"]
           expected = ["gpl-3.0", "mit", "cc-by-4.0", "apache-2.0"]
           assert [r.key for r in normalise_licenses(raw)] == expected

       def test_batch_accepts_generator(self):
           results = normalise_licenses(x for x in ["MIT", "ISC"])
           assert results[0].key == "mit"

       def test_batch_empty(self):
           assert normalise_licenses([]) == []


src/licence_normaliser/tests/test_exceptions.py
===============================================

src/licence_normaliser/tests/test_exceptions.py

   """Tests for strict mode and the public exception hierarchy."""

   import pytest

   from licence_normaliser import normalise_license, normalise_licenses
   from licence_normaliser.exceptions import (
       LicenseNormaliserError,
       LicenseNotFoundError,
   )

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"


   class TestLicenseNotFoundError:
       def test_is_subclass_of_base(self):
           assert issubclass(LicenseNotFoundError, LicenseNormaliserError)

       def test_is_subclass_of_exception(self):
           assert issubclass(LicenseNotFoundError, Exception)

       def test_attributes(self):
           exc = LicenseNotFoundError("My License", "my license")
           assert exc.raw == "My License"
           assert exc.cleaned == "my license"

       def test_str_contains_raw(self):
           exc = LicenseNotFoundError("My License", "my license")
           assert "My License" in str(exc)

       def test_str_mentions_strict_false(self):
           exc = LicenseNotFoundError("x", "x")
           assert "strict=False" in str(exc)


   class TestStrictModeNormalise:
       def test_known_license_no_raise(self):
           # Known licenses must not raise in strict mode
           v = normalise_license("MIT", strict=True)
           assert v.key == "mit"

       def test_unknown_raises_license_not_found(self):
           with pytest.raises(LicenseNotFoundError) as exc_info:
               normalise_license("totally-unknown-xyz-9999", strict=True)
           assert exc_info.value.raw == "totally-unknown-xyz-9999"
           assert exc_info.value.cleaned == "totally-unknown-xyz-9999"

       def test_empty_string_raises(self):
           with pytest.raises(LicenseNotFoundError):
               normalise_license("", strict=True)

       def test_whitespace_only_raises(self):
           with pytest.raises(LicenseNotFoundError):
               normalise_license("   ", strict=True)

       def test_cc_url_known_no_raise(self):
           v = normalise_license(
               "https://creativecommons.org/licenses/by/4.0/", strict=True
           )
           assert v.key == "cc-by-4.0"

       def test_strict_false_unknown_returns_unknown(self):
           # Default (strict=False): silently returns unknown
           v = normalise_license("no-such-license-xyzzy", strict=False)
           assert v.family.key == "unknown"

       def test_strict_default_is_false(self):
           # Calling without strict kwarg should not raise
           v = normalise_license("no-such-license-xyzzy")
           assert v.family.key == "unknown"


   class TestStrictModeBatch:
       def test_all_known_no_raise(self):
           results = normalise_licenses(["MIT", "Apache-2.0"], strict=True)
           assert len(results) == 2
           assert results[0].key == "mit"
           assert results[1].key == "apache-2.0"

       def test_one_unknown_raises(self):
           with pytest.raises(LicenseNotFoundError):
               normalise_licenses(["MIT", "no-such-license-xyz"], strict=True)

       def test_non_strict_batch_with_unknown(self):
           results = normalise_licenses(["MIT", "no-such-license-xyz"], strict=False)
           assert results[0].key == "mit"
           assert results[1].family.key == "unknown"

       def test_empty_batch_strict(self):
           # Empty input should not raise even in strict mode
           assert normalise_licenses([], strict=True) == []


src/licence_normaliser/tests/test_integration.py
================================================

src/licence_normaliser/tests/test_integration.py

   """Comprehensive integration tests covering the full license matrix.

   Each tuple: (input_string, expected_version_key, expected_license_key,
                expected_family_key)
   """

   import pytest

   from licence_normaliser import (
       LicenseNormalisationError,
       LicenseNotFoundError,
       LicenseVersion,
       normalise_license,
       normalise_licenses,
   )

   LICENSE_MATRIX = [
       # raw,expected_key,expected_license,expected_family
       # === OSI-approved licenses ===
       ("mit", "mit", "mit", "osi"),
       ("MIT", "mit", "mit", "osi"),
       ("  mit  ", "mit", "mit", "osi"),
       ("apache-2.0", "apache-2.0", "apache", "osi"),
       ("Apache-2.0", "apache-2.0", "apache", "osi"),
       ("Apache 2.0", "apache-2.0", "apache", "osi"),
       ("Apache License 2.0", "apache-2.0", "apache", "osi"),
       (
           "BSD 3-Clause",
           "bsd-3-clause",
           "bsd-3-clause",
           "osi",
       ),  # Resolves to bsd-3-clause/osi, matches SPDX and alias entries
       ("bsd-3-clause", "bsd-3-clause", "bsd-3-clause", "osi"),
       ("BSD License", "bsd-3-clause", "bsd-3-clause", "osi"),
       ("MPL-2.0", "mpl-2.0", "mpl", "osi"),
       ("mpl-2.0", "mpl-2.0", "mpl", "osi"),
       (
           "Mozilla Public License 2.0",
           "mpl-2.0",
           "mpl",
           "osi",
       ),  # Canonical full name of MPL-2.0, matches alias entry
       ("ISC", "isc", "isc", "osi"),
       ("isc", "isc", "isc", "osi"),
       ("ISC License", "isc", "isc", "osi"),
       ("Unlicense", "unlicense", "unlicense", "osi"),
       ("unlicense", "unlicense", "unlicense", "osi"),
       ("WTFPL", "wtfpl", "wtfpl", "osi"),
       ("wtfpl", "wtfpl", "wtfpl", "osi"),
       ("Zlib", "zlib", "zlib", "osi"),
       ("zlib", "zlib", "zlib", "osi"),
       # === GPL / AGPL / LGPL (copyleft) ===
       ("gpl-3.0", "gpl-3.0", "gpl-3", "copyleft"),
       ("GPL-3.0", "gpl-3.0", "gpl-3", "copyleft"),
       ("gpl-3.0+", "gpl-3.0", "gpl-3", "copyleft"),
       (
           "gpl-3-0",
           "gpl-3-0",
           "gpl-3-0",
           "copyleft",
       ),  # NOTE: hyphen instead of dot; resolver recognises gpl but doesn't normalise
       ("GNU GPL v3", "gpl-3.0", "gpl-3", "copyleft"),
       ("GPL v3", "gpl-3.0", "gpl-3", "copyleft"),
       ("gpl-2.0", "gpl-2.0", "gpl-2", "copyleft"),
       ("GPL v2", "gpl-2.0", "gpl-2", "copyleft"),
       ("lgpl-3.0", "lgpl-3.0", "lgpl-3", "copyleft"),
       ("LGPL-3.0", "lgpl-3.0", "lgpl-3", "copyleft"),
       ("lgpl-2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"),
       ("LGPL v2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"),
       ("lgpl v2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"),
       ("agpl-3.0", "agpl-3.0", "agpl-3", "copyleft"),
       ("AGPL v3", "agpl-3.0", "agpl-3", "copyleft"),
       # === Creative Commons ===
       ("CC BY 4.0", "cc-by-4.0", "cc-by", "cc"),
       ("cc by 4.0", "cc-by-4.0", "cc-by", "cc"),
       ("cc-by-4.0", "cc-by-4.0", "cc-by", "cc"),
       ("CC BY 3.0", "cc-by-3.0", "cc-by", "cc"),
       ("cc by 3.0", "cc-by-3.0", "cc-by", "cc"),
       ("cc-by-3.0", "cc-by-3.0", "cc-by", "cc"),
       ("CC BY 2.5", "cc-by-2.5", "cc-by", "cc"),
       ("CC BY 2.0", "cc-by-2.0", "cc-by", "cc"),
       ("CC BY 1.0", "cc-by-1.0", "cc-by", "cc"),
       ("cc by", "cc-by", "cc-by", "cc"),
       (
           "CC-BY",
           "cc-by",
           "cc-by",
           "cc",
       ),  # SPDX form, resolves to cc-by/cc
       ("CC BY-NC 4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"),
       ("cc by-nc 4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"),
       ("cc-by-nc-4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"),
       ("CC BY-NC 3.0", "cc-by-nc-3.0", "cc-by-nc", "cc"),
       ("CC BY-NC-SA 4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"),
       ("cc by-nc-sa 4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"),
       ("cc-by-nc-sa-4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"),
       ("CC BY-NC-SA 3.0", "cc-by-nc-sa-3.0", "cc-by-nc-sa", "cc"),
       ("CC BY-NC-ND 4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
       ("cc by-nc-nd 4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
       ("cc-by-nc-nd-4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
       ("CC BY-NC-ND 3.0", "cc-by-nc-nd-3.0", "cc-by-nc-nd", "cc"),
       ("cc by-nc-nd 3.0", "cc-by-nc-nd-3.0", "cc-by-nc-nd", "cc"),
       ("CC BY-ND 4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"),
       ("cc by-nd 4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"),
       ("cc-by-nd-4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"),
       ("CC BY-SA 4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"),
       ("cc by-sa 4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"),
       ("cc-by-sa-4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"),
       ("CC BY-SA 3.0", "cc-by-sa-3.0", "cc-by-sa", "cc"),
       ("cc-by-3.0-igo", "cc-by-3.0-igo", "cc-by", "cc"),
       ("cc-by-nc-nd-3.0-igo", "cc-by-nc-nd-3.0-igo", "cc-by-nc-nd", "cc"),
       # CC0
       ("CC0 1.0", "cc0-1.0", "cc0", "cc0"),
       ("cc0 1.0", "cc0-1.0", "cc0", "cc0"),
       ("cc0-1.0", "cc0-1.0", "cc0", "cc0"),
       ("CC0", "cc0-1.0", "cc0", "cc0"),
       ("cc0", "cc0-1.0", "cc0", "cc0"),
       ("cc-zero", "cc0-1.0", "cc0", "cc0"),
       ("CC Zero", "cc0-1.0", "cc0", "cc0"),
       ("CC-Zero", "cc0-1.0", "cc0", "cc0"),
       ("creative commons zero", "cc0-1.0", "cc0", "cc0"),
       ("Creative Commons Zero 1.0", "cc0-1.0", "cc0", "cc0"),
       # CC-PDM
       ("cc-pdm", "cc-pdm-1.0", "cc-pdm", "public-domain"),
       ("CC-PDM", "cc-pdm-1.0", "cc-pdm", "public-domain"),
       ("cc-pdm-1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"),
       ("CC-PDM 1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"),
       ("cc-pdm 1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"),
       ("creative commons public domain", "cc-pdm-1.0", "cc-pdm", "public-domain"),
       # CC shorthand
       ("creative commons by", "cc-by", "cc-by", "cc"),
       ("creative commons by 4.0", "cc-by-4.0", "cc-by", "cc"),
       (
           "creative commons by-sa",
           "cc-by-sa",
           "cc-by-sa",
           "cc",
       ),  # Specifies by-sa, license must be cc-by-sa
       (
           "creative commons by-nc",
           "cc-by-nc",
           "cc-by-nc",
           "cc",
       ),  # Specifies by-nc, license must be cc-by-nc
       (
           "creative commons by-nc-sa",
           "cc-by-nc-sa",
           "cc-by-nc-sa",
           "cc",
       ),  # Specifies by-nc-sa, license must be cc-by-nc-sa
       (
           "creative commons by-nc-nd",
           "cc-by-nc-nd",
           "cc-by-nc-nd",
           "cc",
       ),  # Specifies by-nc-nd, license must be cc-by-nc-nd
       (
           "creative commons by-nd",
           "cc-by-nd",
           "cc-by-nd",
           "cc",
       ),  # Specifies by-nd, license must be cc-by-nd
       # CC URLs
       (
           "http://creativecommons.org/licenses/by-nc-nd/4.0/",
           "cc-by-nc-nd-4.0",
           "cc-by-nc-nd",
           "cc",
       ),
       ("https://creativecommons.org/licenses/by/4.0/", "cc-by-4.0", "cc-by", "cc"),
       ("http://creativecommons.org/licenses/by/4.0/", "cc-by-4.0", "cc-by", "cc"),
       (
           "https://creativecommons.org/licenses/by-nc/4.0/",
           "cc-by-nc-4.0",
           "cc-by-nc",
           "cc",
       ),
       (
           "https://creativecommons.org/licenses/by-nc-sa/4.0/",
           "cc-by-nc-sa-4.0",
           "cc-by-nc-sa",
           "cc",
       ),
       (
           "https://creativecommons.org/licenses/by-nd/4.0/",
           "cc-by-nd-4.0",
           "cc-by-nd",
           "cc",
       ),
       (
           "https://creativecommons.org/licenses/by-sa/4.0/",
           "cc-by-sa-4.0",
           "cc-by-sa",
           "cc",
       ),
       (
           "http://creativecommons.org/licenses/by-nc-nd/3.0/igo/",
           "cc-by-nc-nd-3.0-igo",
           "cc-by-nc-nd",
           "cc",
       ),
       (
           "https://creativecommons.org/licenses/by/3.0/igo/",
           "cc-by-3.0-igo",
           "cc-by",
           "cc",
       ),
       ("https://creativecommons.org/publicdomain/zero/1.0/", "cc0-1.0", "cc0", "cc0"),
       ("http://creativecommons.org/publicdomain/zero/1.0/", "cc0-1.0", "cc0", "cc0"),
       # CC prose
       ("licensed under cc by-nc-nd 4.0 terms", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
       (
           "content is licensed under creative commons by-nc-sa",
           "cc-by-nc-sa",
           "cc-by-nc-sa",  # Contains by-nc-sa, license must be cc-by-nc-sa
           "cc",
       ),
       ("this content is under creative commons by license", "cc-by", "cc-by", "cc"),
       # Open Data
       ("ODbL", "odbl", "odbl", "open-data"),
       ("odbl", "odbl", "odbl", "open-data"),
       ("Open Database License", "odbl", "odbl", "open-data"),
       ("ODC-BY", "odc-by", "odc-by", "open-data"),
       ("odc-by", "odc-by", "odc-by", "open-data"),
       ("PDDL", "pddl", "pddl", "open-data"),
       ("pddl", "pddl", "pddl", "open-data"),
       (
           "Open Data Commons Public Domain Dedication",
           "public-domain",
           "public-domain",
           "public-domain",
       ),
       # Publisher
       ("elsevier-oa", "elsevier-oa", "elsevier-oa", "publisher-oa"),
       (
           "Elsevier OA",
           "elsevier-oa",
           "elsevier-oa",
           "publisher-oa",
       ),  # "Elsevier OA" unambiguously identifies Elsevier OA license
       ("elsevier tdm", "elsevier-tdm", "elsevier-tdm", "publisher-tdm"),
       ("Elsevier TDM", "elsevier-tdm", "elsevier-tdm", "publisher-tdm"),
       ("Elsevier User License", "elsevier-oa", "elsevier-oa", "publisher-oa"),
       (
           "https://www.elsevier.com/open-access/userlicense/1.0/",
           "elsevier-oa",
           "elsevier-oa",
           "publisher-oa",
       ),
       ("wiley-tdm", "wiley-tdm", "wiley-tdm", "publisher-tdm"),
       ("Wiley TDM", "wiley-tdm", "wiley-tdm", "publisher-tdm"),
       ("wiley vor", "wiley-vor", "wiley-vor", "publisher-proprietary"),
       ("springer-tdm", "springer-tdm", "springer-tdm", "publisher-tdm"),
       (
           "Springer Nature TDM",
           "springernature-tdm",
           "springernature-tdm",
           "publisher-tdm",
       ),
       ("acs-authorchoice", "acs-authorchoice", "acs-authorchoice", "publisher-oa"),
       ("ACS AuthorChoice", "acs-authorchoice", "acs-authorchoice", "publisher-oa"),
       (
           "acs-authorchoice-ccby",
           "acs-authorchoice-ccby",
           "acs-authorchoice-ccby",
           "publisher-oa",
       ),
       (
           "acs authorchoice cc by",
           "acs-authorchoice-ccby",
           "acs-authorchoice-ccby",
           "publisher-oa",
       ),
       ("aps-default", "aps-default", "aps-default", "publisher-proprietary"),
       ("APS Default", "aps-default", "aps-default", "publisher-proprietary"),
       ("iop-tdm", "iop-tdm", "iop-tdm", "publisher-tdm"),
       ("iop copyright", "iop-copyright", "iop-copyright", "publisher-proprietary"),
       ("bmj copyright", "bmj-copyright", "bmj-copyright", "publisher-proprietary"),
       ("rsc terms", "rsc-terms", "rsc-terms", "publisher-proprietary"),
       ("cup terms", "cup-terms", "cup-terms", "publisher-proprietary"),
       ("degruyter terms", "degruyter-terms", "degruyter-terms", "publisher-proprietary"),
       ("tandf terms", "tandf-terms", "tandf-terms", "publisher-proprietary"),
       (
           "sage permissions",
           "sage-permissions",
           "sage-permissions",
           "publisher-proprietary",
       ),
       ("wiley terms", "wiley-terms", "wiley-terms", "publisher-proprietary"),
       ("wiley am", "wiley-am", "wiley-am", "publisher-proprietary"),
       ("pnas licenses", "pnas-licenses", "pnas-licenses", "publisher-proprietary"),
       (
           "aaas author reuse",
           "aaas-author-reuse",
           "aaas-author-reuse",
           "publisher-proprietary",
       ),
       ("aip rights", "aip-rights", "aip-rights", "publisher-proprietary"),
       ("jama cc by", "jama-cc-by", "jama-cc-by", "publisher-oa"),
       ("thieme nlm", "thieme-nlm", "thieme-nlm", "publisher-oa"),
       ("oup chorus", "oup-chorus", "oup-chorus", "publisher-oa"),
       ("implied oa", "implied-oa", "implied-oa", "publisher-oa"),
       ("implied open access", "implied-oa", "implied-oa", "publisher-oa"),
       ("unspecified oa", "unspecified-oa", "unspecified-oa", "other-oa"),
       (
           "publisher specific oa",
           "publisher-specific-oa",
           "publisher-specific-oa",
           "publisher-oa",
       ),
       ("author manuscript", "author-manuscript", "author-manuscript", "publisher-oa"),
       ("open access", "other-oa", "other-oa", "other-oa"),
       ("other-oa", "other-oa", "other-oa", "other-oa"),
       (
           "all rights reserved",
           "all-rights-reserved",
           "all-rights-reserved",
           "publisher-proprietary",
       ),
       ("no reuse", "no-reuse", "no-reuse", "publisher-proprietary"),
       # Publisher prose
       (
           "this article is licensed under elsevier tdm agreement",
           "elsevier-tdm",
           "elsevier-tdm",
           "publisher-tdm",
       ),
       (
           "journal article under elsevier user license for open access",
           "elsevier-oa",
           "elsevier-oa",
           "publisher-oa",
       ),
       (
           "acs authorchoice option was selected by the authors",
           "acs-authorchoice",
           "acs-authorchoice",
           "publisher-oa",
       ),
       (
           "springer tdm policy applies to this content",
           "springer-tdm",
           "springer-tdm",
           "publisher-tdm",
       ),
       # Unknown
       (
           "Totally Fake License XYZ999",
           "totally fake license xyz999",
           "totally fake license xyz999",
           "unknown",
       ),
       # Public domain
       ("public domain", "public-domain", "public-domain", "public-domain"),
       ("public-domain", "public-domain", "public-domain", "public-domain"),
       ("pd", "public-domain", "public-domain", "public-domain"),
   ]


   @pytest.mark.parametrize(
       "raw,expected_key,expected_license,expected_family", LICENSE_MATRIX
   )
   def test_license_matrix(raw, expected_key, expected_license, expected_family):
       v = normalise_license(raw)
       assert v.key == expected_key, f"input: {raw!r}  key: {v.key!r} != {expected_key!r}"
       assert v.license.key == expected_license, (
           f"input: {raw!r}  license: {v.license.key!r} != {expected_license!r}"
       )
       assert v.family.key == expected_family, (
           f"input: {raw!r}  family: {v.family.key!r} != {expected_family!r}"
       )


   def test_strict_mode_unknown_raises():
       with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)):
           normalise_license("xyzzy unknown license 123", strict=True)


   def test_strict_mode_known_does_not_raise():
       v = normalise_license("mit", strict=False)
       assert v.key == "mit"


   def test_empty_string_returns_unknown():
       v = normalise_license("")
       assert v.key == "unknown"
       assert v.family.key == "unknown"


   def test_whitespace_only_returns_unknown():
       v = normalise_license("   \n\t  ")
       assert v.key == "unknown"


   def test_batch_normalise_preserves_order():
       inputs = ["MIT", "Apache-2.0", "CC BY 4.0", "unknown garbage"]
       results = normalise_licenses(inputs)
       assert [r.key for r in results] == [
           "mit",
           "apache-2.0",
           "cc-by-4.0",
           "unknown garbage",
       ]


   def test_normalise_mit():
       v = normalise_license("MIT")
       assert isinstance(v, LicenseVersion)
       assert v.key == "mit"
       assert str(v) == "mit"
       assert str(v.license) == "mit"


   def test_normalise_cc():
       v = normalise_license("CC BY 4.0")
       assert v.key == "cc-by-4.0"
       assert str(v.license) == "cc-by"
       assert str(v.family) == "cc"


   def test_batch():
       results = normalise_licenses(["MIT", "Apache-2.0"])
       assert len(results) == 2
       assert results[0].key == "mit"
       assert results[1].key == "apache-2.0"


   def test_strict_mode_raises():
       with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)):
           normalise_license("Totally Fake License XYZ999", strict=True)


   def test_strict_batch_raises():
       with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)):
           normalise_licenses(["MIT", "Fake License XYZ999"], strict=True)


   def test_empty_input():
       v = normalise_license("")
       assert v.key == "unknown"
       v = normalise_license("   ")
       assert v.key == "unknown"


   def test_real_world_license_strings():
       """Test against real-world license strings collected from the wild."""
       cases = [
           ("http://creativecommons.org/licenses/by-nc-nd/4.0/", "cc-by-nc-nd-4.0"),
           ("http://creativecommons.org/licenses/by/4.0/", "cc-by-4.0"),
           ("http://creativecommons.org/licenses/by-nc/4.0/", "cc-by-nc-4.0"),
           (
               "http://www.elsevier.com/open-access/userlicense/1.0/",
               "elsevier-oa",
           ),
           (
               "http://creativecommons.org/licenses/by-nc-nd/3.0/igo/",
               "cc-by-nc-nd-3.0-igo",
           ),
           ("CC BY-NC-ND 4.0", "cc-by-nc-nd-4.0"),
           (
               "http://creativecommons.org/licenses/by/3.0/igo/",
               "cc-by-3.0-igo",
           ),
       ]
       for raw, expected_key in cases:
           v = normalise_license(raw)
           assert v.key == expected_key, (
               f"input: {raw!r} -> got {v.key!r}, want {expected_key!r}"
           )


src/licence_normaliser/tests/test_models.py
===========================================

src/licence_normaliser/tests/test_models.py

   """Unit tests for _models.py."""

   import pytest

   from licence_normaliser._models import LicenseFamily, LicenseName, LicenseVersion

   __author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
   __copyright__ = "2026 Artur Barseghyan"
   __license__ = "MIT"


   def _cc_fam():
       return LicenseFamily(key="cc")


   def _osi_fam():
       return LicenseFamily(key="osi")


   def _cc_by_name():
       return LicenseName(key="cc-by", family=_cc_fam())


   def _mit_version():
       return LicenseVersion(
           key="mit",
           url="https://opensource.org/licenses/MIT",
           license=LicenseName(key="mit", family=_osi_fam()),
       )


   class TestLicenseFamily:
       def test_str(self):
           assert str(LicenseFamily(key="cc")) == "cc"

       def test_repr(self):
           assert repr(LicenseFamily(key="osi")) == "LicenseFamily('osi')"

       def test_eq_same_type(self):
           assert LicenseFamily(key="cc") == LicenseFamily(key="cc")

       def test_eq_str(self):
           assert LicenseFamily(key="cc") == "cc"

       def test_neq(self):
           assert LicenseFamily(key="cc") != LicenseFamily(key="osi")

       def test_hash_usable_in_set(self):
           s = {LicenseFamily(key="cc"), LicenseFamily(key="cc"), LicenseFamily(key="osi")}
           assert len(s) == 2

       def test_frozen_prevents_mutation(self):
           fam = LicenseFamily(key="cc")
           with pytest.raises((AttributeError, TypeError)):
               fam.key = "other"  # type: ignore


   class TestLicenseName:
       def test_str(self):
           assert str(_cc_by_name()) == "cc-by"

       def test_frozen_prevents_mutation(self):
           name = _cc_by_name()
           with pytest.raises((AttributeError, TypeError)):
               name.key = "other"  # type: ignore

       def test_family_reference(self):
           assert _cc_by_name().family.key == "cc"


   class TestLicenseVersion:
       def test_str(self):
           assert str(_mit_version()) == "mit"

       def test_family_shortcut(self):
           assert _mit_version().family.key == "osi"

       def test_frozen_prevents_mutation(self):
           v = _mit_version()
           with pytest.raises((AttributeError, TypeError)):
               v.key = "other"  # type: ignore

       def test_url_stored(self):
           assert _mit_version().url == "https://opensource.org/licenses/MIT"

       def test_url_none(self):
           v = LicenseVersion(
               key="unknown",
               url=None,
               license=LicenseName(key="unknown", family=LicenseFamily(key="unknown")),
           )
           assert v.url is None


src/licence_normaliser/tests/test_prose.py
==========================================

src/licence_normaliser/tests/test_prose.py

   """Tests for prose pattern matching via ProseParser."""

   from licence_normaliser import normalise_license


   class TestProsePatternMatching:
       def test_cc_by_nc_nd_4_0_prose(self):
           v = normalise_license("this work is licensed under cc by-nc-nd 4.0 terms")
           assert v.key == "cc-by-nc-nd-4.0"
           assert v.family.key == "cc"

       def test_cc_by_nc_nd_3_0_prose(self):
           v = normalise_license("license: cc by-nc-nd 3.0")
           assert v.key == "cc-by-nc-nd-3.0"
           assert v.family.key == "cc"

       def test_cc_by_nc_sa_creative_commons_prose(self):
           v = normalise_license("content licensed under creative commons by-nc-sa")
           assert v.key == "cc-by-nc-sa"
           assert v.family.key == "cc"

       def test_attribution_prose(self):
           v = normalise_license(
               "this content is made available under creative commons by license"
           )
           assert v.key == "cc-by"
           assert v.family.key == "cc"

       def test_attribution_noncommercial_prose(self):
           v = normalise_license(
               "this article is licensed under attribution noncommercial terms"
           )
           assert v.key == "cc-by-nc"
           assert v.family.key == "cc"

       def test_attribution_sharealike_prose(self):
           v = normalise_license("licensed under attribution share alike conditions")
           assert v.key == "cc-by-sa"
           assert v.family.key == "cc"

       def test_elsevier_tdm_prose(self):
           v = normalise_license(
               "this journal participates in text and data mining as "
               "permitted by the elsevier tdm agreement"
           )
           assert v.key == "elsevier-tdm"
           assert v.family.key == "publisher-tdm"

       def test_elsevier_user_license_prose(self):
           v = normalise_license(
               "elsevier user license applies to this open access article"
           )
           assert v.key == "elsevier-oa"
           assert v.family.key == "publisher-oa"

       def test_acs_authorchoice_prose(self):
           v = normalise_license("acs authorchoice option was selected by the authors")
           assert v.key == "acs-authorchoice"
           assert v.family.key == "publisher-oa"

       def test_all_rights_reserved_prose(self):
           v = normalise_license("all rights reserved except as permitted by law")
           assert v.key == "all-rights-reserved"
           assert v.family.key == "publisher-proprietary"

       def test_short_string_via_registry(self):
           v = normalise_license("cc by-nc-nd")
           assert v.key == "cc-by-nc-nd"
           assert v.family.key == "cc"

       def test_open_access_prose_matched(self):
           v = normalise_license("open access article available now")
           assert v.key == "other-oa"
           assert v.family.key == "other-oa"


src/licence_normaliser/tests/test_publisher.py
==============================================

src/licence_normaliser/tests/test_publisher.py

   """Tests for PublisherParser - publisher URLs and shorthand aliases."""

   from licence_normaliser import normalise_license


   class TestPublisherUrls:
       def test_elsevier_oa_url(self):
           v = normalise_license("https://www.elsevier.com/open-access/userlicense/1.0/")
           assert v.key == "elsevier-oa"
           assert v.family.key == "publisher-oa"

       def test_elsevier_oa_url_http(self):
           v = normalise_license("http://www.elsevier.com/open-access/userlicense/1.0/")
           assert v.key == "elsevier-oa"
           assert v.family.key == "publisher-oa"

       def test_elsevier_tdm_url(self):
           v = normalise_license("https://www.elsevier.com/tdm/userlicense/1.0/")
           assert v.key == "elsevier-tdm"
           assert v.family.key == "publisher-tdm"

       def test_wiley_tdm_url(self):
           v = normalise_license("http://doi.wiley.com/10.1002/tdm_license_1")
           assert v.key == "wiley-tdm"
           assert v.family.key == "publisher-tdm"

       def test_wiley_terms_url(self):
           v = normalise_license("https://onlinelibrary.wiley.com/terms-and-conditions")
           assert v.key == "wiley-terms"
           assert v.family.key == "publisher-proprietary"

       def test_springer_tdm_url(self):
           v = normalise_license("https://www.springer.com/tdm")
           assert v.key == "springer-tdm"
           assert v.family.key == "publisher-tdm"

       def test_springernature_tdm_url(self):
           v = normalise_license(
               "https://www.springernature.com/gp/researchers/text-and-data-mining"
           )
           assert v.key == "springernature-tdm"
           assert v.family.key == "publisher-tdm"

       def test_acs_authorchoice_ccby_url(self):
           v = normalise_license(
               "https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html"
           )
           assert v.key == "acs-authorchoice-ccby"
           assert v.family.key == "publisher-oa"

       def test_acs_authorchoice_url(self):
           v = normalise_license(
               "https://pubs.acs.org/page/policy/authorchoice_termsofuse.html"
           )
           assert v.key == "acs-authorchoice"
           assert v.family.key == "publisher-oa"

       def test_acs_authorchoice_nih_url(self):
           v = normalise_license(
               "https://pubs.acs.org/page/policy/"
               "acs_authorchoice_with_nih_addendum_termsofuse.html"
           )
           assert v.key == "acs-authorchoice-nih"
           assert v.family.key == "publisher-oa"

       def test_rsc_terms_url(self):
           v = normalise_license(
               "https://www.rsc.org/journals-books-databases/"
               "journal-authors-reviewers/licences-copyright-permissions/"
           )
           assert v.key == "rsc-terms"
           assert v.family.key == "publisher-proprietary"

       def test_iop_tdm_url(self):
           v = normalise_license(
               "https://iopscience.iop.org/info/page/text-and-data-mining"
           )
           assert v.key == "iop-tdm"
           assert v.family.key == "publisher-tdm"

       def test_bmj_copyright_url(self):
           v = normalise_license(
               "https://www.bmj.com/company/legal-stuff/copyright-notice/"
           )
           assert v.key == "bmj-copyright"
           assert v.family.key == "publisher-proprietary"

       def test_aaas_author_reuse_url(self):
           v = normalise_license(
               "https://www.science.org/content/page/science-licenses-journal-article-reuse"
           )
           assert v.key == "aaas-author-reuse"
           assert v.family.key == "publisher-proprietary"

       def test_aps_default_url(self):
           v = normalise_license("https://link.aps.org/licenses/aps-default-license")
           assert v.key == "aps-default"
           assert v.family.key == "publisher-proprietary"

       def test_aps_tdm_url(self):
           v = normalise_license(
               "https://link.aps.org/licenses/aps-default-text-mining-license"
           )
           assert v.key == "aps-tdm"
           assert v.family.key == "publisher-tdm"

       def test_cup_terms_url(self):
           v = normalise_license("https://www.cambridge.org/core/terms")
           assert v.key == "cup-terms"
           assert v.family.key == "publisher-proprietary"

       def test_aip_rights_url(self):
           v = normalise_license(
               "https://publishing.aip.org/authors/rights-and-permissions"
           )
           assert v.key == "aip-rights"
           assert v.family.key == "publisher-proprietary"

       def test_jama_cc_by_url(self):
           v = normalise_license("https://jamanetwork.com/pages/cc-by-license-permissions")
           assert v.key == "jama-cc-by"
           assert v.family.key == "publisher-oa"

       def test_oup_chorus_url(self):
           v = normalise_license(
               "https://academic.oup.com/journals/pages/open_access/"
               "funder_policies/chorus/standard_publication_model"
           )
           assert v.key == "oup-chorus"
           assert v.family.key == "publisher-oa"

       def test_oup_terms_url(self):
           v = normalise_license(
               "https://academic.oup.com/pages/standard-publication-reuse-rights"
           )
           assert v.key == "oup-terms"
           assert v.family.key == "publisher-proprietary"

       def test_sage_permissions_url(self):
           v = normalise_license("https://us.sagepub.com/en-us/nam/journals-permissions")
           assert v.key == "sage-permissions"
           assert v.family.key == "publisher-proprietary"

       def test_tandf_terms_url(self):
           v = normalise_license("https://www.tandfonline.com/action/showCopyRight")
           assert v.key == "tandf-terms"
           assert v.family.key == "publisher-proprietary"

       def test_gnu_gpl_url(self):
           v = normalise_license("https://www.gnu.org/licenses/gpl-3.0.html")
           assert v.key == "gpl-3.0"
           assert v.family.key == "copyleft"


   class TestPublisherShorthand:
       def test_elsevier_user_license(self):
           v = normalise_license("elsevier user license")
           assert v.key == "elsevier-oa"
           assert v.family.key == "publisher-oa"

       def test_elsevier_tdm_shorthand(self):
           v = normalise_license("elsevier tdm")
           assert v.key == "elsevier-tdm"
           assert v.family.key == "publisher-tdm"

       def test_wiley_tdm_shorthand(self):
           v = normalise_license("wiley tdm license")
           assert v.key == "wiley-tdm"
           assert v.family.key == "publisher-tdm"

       def test_wiley_vor(self):
           v = normalise_license("wiley vor")
           assert v.key == "wiley-vor"
           assert v.family.key == "publisher-proprietary"

       def test_wiley_am(self):
           v = normalise_license("wiley am")
           assert v.key == "wiley-am"
           assert v.family.key == "publisher-proprietary"

       def test_springer_tdm_shorthand(self):
           v = normalise_license("springer tdm")
           assert v.key == "springer-tdm"
           assert v.family.key == "publisher-tdm"

       def test_springer_nature_tdm_shorthand(self):
           v = normalise_license("springer nature tdm")
           assert v.key == "springernature-tdm"
           assert v.family.key == "publisher-tdm"

       def test_acs_authorchoice_shorthand(self):
           v = normalise_license("acs authorchoice")
           assert v.key == "acs-authorchoice"
           assert v.family.key == "publisher-oa"

       def test_acs_authorchoice_ccby_shorthand(self):
           v = normalise_license("acs authorchoice cc by")
           assert v.key == "acs-authorchoice-ccby"
           assert v.family.key == "publisher-oa"

       def test_acs_authorchoice_nih_shorthand(self):
           v = normalise_license("acs authorchoice nih")
           assert v.key == "acs-authorchoice-nih"
           assert v.family.key == "publisher-oa"

       def test_rsc_terms_shorthand(self):
           v = normalise_license("rsc terms")
           assert v.key == "rsc-terms"
           assert v.family.key == "publisher-proprietary"

       def test_iop_tdm_shorthand(self):
           v = normalise_license("iop tdm")
           assert v.key == "iop-tdm"
           assert v.family.key == "publisher-tdm"

       def test_iop_copyright_shorthand(self):
           v = normalise_license("iop copyright")
           assert v.key == "iop-copyright"
           assert v.family.key == "publisher-proprietary"

       def test_bmj_copyright_shorthand(self):
           v = normalise_license("bmj copyright")
           assert v.key == "bmj-copyright"
           assert v.family.key == "publisher-proprietary"

       def test_aaas_author_reuse_shorthand(self):
           v = normalise_license("aaas author reuse")
           assert v.key == "aaas-author-reuse"
           assert v.family.key == "publisher-proprietary"

       def test_pnas_licenses_shorthand(self):
           v = normalise_license("pnas licenses")
           assert v.key == "pnas-licenses"
           assert v.family.key == "publisher-proprietary"

       def test_aps_default_shorthand(self):
           v = normalise_license("aps default")
           assert v.key == "aps-default"
           assert v.family.key == "publisher-proprietary"

       def test_aps_tdm_shorthand(self):
           v = normalise_license("aps tdm")
           assert v.key == "aps-tdm"
           assert v.family.key == "publisher-tdm"

       def test_cup_terms_shorthand(self):
           v = normalise_license("cup terms")
           assert v.key == "cup-terms"
           assert v.family.key == "publisher-proprietary"

       def test_aip_rights_shorthand(self):
           v = normalise_license("aip rights")
           assert v.key == "aip-rights"
           assert v.family.key == "publisher-proprietary"

       def test_jama_cc_by_shorthand(self):
           v = normalise_license("jama cc by")
           assert v.key == "jama-cc-by"
           assert v.family.key == "publisher-oa"

       def test_degruyter_terms_shorthand(self):
           v = normalise_license("degruyter terms")
           assert v.key == "degruyter-terms"
           assert v.family.key == "publisher-proprietary"

       def test_oup_chorus_shorthand(self):
           v = normalise_license("oup chorus")
           assert v.key == "oup-chorus"
           assert v.family.key == "publisher-oa"

       def test_oup_terms_shorthand(self):
           v = normalise_license("oup terms")
           assert v.key == "oup-terms"
           assert v.family.key == "publisher-proprietary"

       def test_sage_permissions_shorthand(self):
           v = normalise_license("sage permissions")
           assert v.key == "sage-permissions"
           assert v.family.key == "publisher-proprietary"

       def test_tandf_terms_shorthand(self):
           v = normalise_license("tandf terms")
           assert v.key == "tandf-terms"
           assert v.family.key == "publisher-proprietary"

       def test_thieme_nlm_shorthand(self):
           v = normalise_license("thieme nlm")
           assert v.key == "thieme-nlm"
           assert v.family.key == "publisher-oa"


   class TestPublisherDirectKeys:
       def test_elsevier_tdm_key(self):
           v = normalise_license("elsevier-tdm")
           assert v.key == "elsevier-tdm"
           assert v.family.key == "publisher-tdm"

       def test_elsevier_oa_key(self):
           v = normalise_license("elsevier-oa")
           assert v.key == "elsevier-oa"
           assert v.family.key == "publisher-oa"

       def test_wiley_tdm_key(self):
           v = normalise_license("wiley-tdm")
           assert v.key == "wiley-tdm"
           assert v.family.key == "publisher-tdm"

       def test_acs_authorchoice_key(self):
           v = normalise_license("acs-authorchoice")
           assert v.key == "acs-authorchoice"
           assert v.family.key == "publisher-oa"

       def test_acs_authorchoice_ccby_key(self):
           v = normalise_license("acs-authorchoice-ccby")
           assert v.key == "acs-authorchoice-ccby"
           assert v.family.key == "publisher-oa"

       def test_acs_authorchoice_nih_key(self):
           v = normalise_license("acs-authorchoice-nih")
           assert v.key == "acs-authorchoice-nih"
           assert v.family.key == "publisher-oa"

       def test_iop_tdm_key(self):
           v = normalise_license("iop-tdm")
           assert v.key == "iop-tdm"
           assert v.family.key == "publisher-tdm"

       def test_aps_tdm_key(self):
           v = normalise_license("aps-tdm")
           assert v.key == "aps-tdm"
           assert v.family.key == "publisher-tdm"

       def test_oup_chorus_key(self):
           v = normalise_license("oup-chorus")
           assert v.key == "oup-chorus"
           assert v.family.key == "publisher-oa"

       def test_jama_cc_by_key(self):
           v = normalise_license("jama-cc-by")
           assert v.key == "jama-cc-by"
           assert v.family.key == "publisher-oa"

       def test_thieme_nlm_key(self):
           v = normalise_license("thieme-nlm")
           assert v.key == "thieme-nlm"
           assert v.family.key == "publisher-oa"

       def test_implied_oa_key(self):
           v = normalise_license("implied-oa")
           assert v.key == "implied-oa"
           assert v.family.key == "publisher-oa"

       def test_unspecified_oa_key(self):
           v = normalise_license("unspecified-oa")
           assert v.key == "unspecified-oa"
           assert v.family.key == "other-oa"

       def test_author_manuscript_key(self):
           v = normalise_license("author-manuscript")
           assert v.key == "author-manuscript"
           assert v.family.key == "publisher-oa"

       def test_all_rights_reserved_key(self):
           v = normalise_license("all-rights-reserved")
           assert v.key == "all-rights-reserved"
           assert v.family.key == "publisher-proprietary"

       def test_no_reuse_key(self):
           v = normalise_license("no-reuse")
           assert v.key == "no-reuse"
           assert v.family.key == "publisher-proprietary"

       def test_other_oa_key(self):
           v = normalise_license("other-oa")
           assert v.key == "other-oa"
           assert v.family.key == "other-oa"

       def test_public_domain_key(self):
           v = normalise_license("public-domain")
           assert v.key == "public-domain"
           assert v.family.key == "public-domain"

       def test_open_access_key(self):
           v = normalise_license("open-access")
           assert v.key == "other-oa"
           assert v.family.key == "other-oa"


   class TestPublisherCatchAll:
       def test_implied_oa_shorthand(self):
           v = normalise_license("implied oa")
           assert v.key == "implied-oa"
           assert v.family.key == "publisher-oa"

       def test_unspecified_oa_shorthand(self):
           v = normalise_license("unspecified oa")
           assert v.key == "unspecified-oa"
           assert v.family.key == "other-oa"

       def test_open_access_shorthand(self):
           v = normalise_license("open access")
           assert v.key == "other-oa"
           assert v.family.key == "other-oa"

       def test_author_manuscript_shorthand(self):
           v = normalise_license("author manuscript")
           assert v.key == "author-manuscript"
           assert v.family.key == "publisher-oa"

       def test_all_rights_reserved_shorthand(self):
           v = normalise_license("all rights reserved")
           assert v.key == "all-rights-reserved"
           assert v.family.key == "publisher-proprietary"

       def test_no_reuse_shorthand(self):
           v = normalise_license("no reuse")
           assert v.key == "no-reuse"
           assert v.family.key == "publisher-proprietary"


   class TestCCPublicDomain:
       def test_cc_pdm_bare_key(self):
           v = normalise_license("cc-pdm")
           assert v.key == "cc-pdm-1.0"
           assert v.family.key == "public-domain"

       def test_cc_pdm_versioned_key(self):
           v = normalise_license("cc-pdm-1.0")
           assert v.key == "cc-pdm-1.0"
           assert v.family.key == "public-domain"

       def test_cc0_bare_key(self):
           v = normalise_license("cc0")
           assert v.key == "cc0-1.0"
           assert v.family.key == "cc0"

       def test_cc0_versioned_key(self):
           v = normalise_license("cc0-1.0")
           assert v.key == "cc0-1.0"
           assert v.family.key == "cc0"

       def test_cc_zero_shorthand(self):
           v = normalise_license("cc-zero")
           assert v.key == "cc0-1.0"
           assert v.family.key == "cc0"

       def test_public_domain_fallback(self):
           v = normalise_license("public-domain")
           assert v.key == "public-domain"
           assert v.family.key == "public-domain"

       def test_creative_commons_zero(self):
           v = normalise_license("creative commons zero")
           assert v.key == "cc0-1.0"
           assert v.family.key == "cc0"

       def test_creative_commons_public_domain(self):
           v = normalise_license("creative commons public domain")
           assert v.key == "cc-pdm-1.0"
           assert v.family.key == "public-domain"