Project source-tree

Below is the layout of the project (to 10 levels), followed by the contents of each key file.

Project directory layout

licence-normaliser/
├── scripts
│   ├── __init__.py
│   ├── check_missing_aliases.py
│   ├── compare_datasets.py
│   ├── README.rst
│   └── test_name_inference.py
├── src
│   └── licence_normaliser
│       ├── cli
│       │   ├── __init__.py
│       │   └── _main.py
│       ├── data
│       │   ├── aliases
│       │   │   └── aliases.json
│       │   ├── prose
│       │   │   └── prose_patterns.json
│       │   ├── publishers
│       │   │   └── publishers.json
│       │   ├── urls
│       │   │   └── url_map.json
│       │   └── README.rst
│       ├── parsers
│       │   ├── __init__.py
│       │   ├── alias.py
│       │   ├── creativecommons.py
│       │   ├── opendefinition.py
│       │   ├── osi.py
│       │   ├── prose.py
│       │   ├── publisher.py
│       │   ├── scancode_licensedb.py
│       │   └── spdx.py
│       ├── tests
│       │   ├── __init__.py
│       │   ├── conftest.py
│       │   ├── test_aliases.py
│       │   ├── test_cache.py
│       │   ├── test_cli.py
│       │   ├── test_core.py
│       │   ├── test_exceptions.py
│       │   ├── test_integration.py
│       │   ├── test_models.py
│       │   ├── test_prose.py
│       │   └── test_publisher.py
│       ├── __init__.py
│       ├── _cache.py
│       ├── _core.py
│       ├── _models.py
│       ├── _normaliser.py
│       ├── _trace.py
│       ├── defaults.py
│       ├── exceptions.py
│       ├── plugins.py
│       └── py.typed
├── AGENTS.md
├── conftest.py
├── CONTRIBUTING.rst
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── pyproject.toml
├── README.rst
└── tox.ini

README.rst

==================
licence-normaliser
==================

.. image:: https://raw.githubusercontent.com/barseghyanartur/licence-normaliser/main/docs/_static/licence_normaliser_logo.webp
   :alt: licence-normaliser logo
   :align: center

Comprehensive license normalsation with a three-level hierarchy.

.. image:: https://img.shields.io/pypi/v/licence-normaliser.svg
   :target: https://pypi.python.org/pypi/licence-normaliser
   :alt: PyPI Version

.. image:: https://img.shields.io/pypi/pyversions/licence-normaliser.svg
   :target: https://pypi.python.org/pypi/licence-normaliser/
   :alt: Supported Python versions

.. image:: https://github.com/barseghyanartur/licence-normaliser/actions/workflows/test.yml/badge.svg?branch=main
   :target: https://github.com/barseghyanartur/licence-normaliser/actions
   :alt: Build Status

.. image:: https://readthedocs.org/projects/licence-normaliser/badge/?version=latest
    :target: http://licence-normaliser.readthedocs.io
    :alt: Documentation Status

.. image:: https://img.shields.io/badge/docs-llms.txt-blue
    :target: https://licence-normaliser.readthedocs.io/en/latest/llms.txt
    :alt: llms.txt - documentation for LLMs

.. image:: https://img.shields.io/badge/license-MIT-blue.svg
   :target: https://github.com/barseghyanartur/licence-normaliser/#License
   :alt: MIT

.. image:: https://coveralls.io/repos/github/barseghyanartur/licence-normaliser/badge.svg?branch=main&service=github
    :target: https://coveralls.io/github/barseghyanartur/licence-normaliser?branch=main
    :alt: Coverage

``licence-normaliser`` is a comprehensive license normalisation library that
maps any license representation (SPDX tokens, URLs, prose descriptions) to a
canonical three-level hierarchy.

Features
========

- **Three-level hierarchy** - LicenseFamily → LicenseName → LicenseVersion.
- **Wide format support** - SPDX tokens, URLs, prose descriptions.
- **Creative Commons support** - Full CC family with versions and IGO variants.
- **Publisher-specific licenses** - Springer, Nature, Elsevier, Wiley, ACS,
  and more.
- **File-driven data** - Add aliases, URLs, and patterns by editing JSON files.
  No Python code changes required for new synonyms.
- **Pluggable parsers** - Drop in a new parser class to ingest
  any external license registry. Parsers implement plugin interfaces
  (``RegistryPlugin``, ``URLPlugin``, etc.).
- **Strict mode** - Raise ``LicenseNotFoundError`` instead of silently
  returning ``"unknown"``.
- **Caching** - LRU caching for performance.
- **CLI** - Command-line interface with ``--strict`` and ``--explain`` support.

Hierarchy
=========

The library uses a three-level hierarchy:

1. **LicenseFamily** - broad bucket: ``"cc"``, ``"osi"``, ``"copyleft"``,
   ``"publisher-tdm"``, ...
2. **LicenseName** - version-free: ``"cc-by"``, ``"cc-by-nc-nd"``, ``"mit"``,
   ``"wiley-tdm"``
3. **LicenseVersion** - fully resolved: ``"cc-by-3.0"``, ``"cc-by-nc-nd-4.0"``

Installation
============

With ``uv``:

.. code-block:: sh

    uv pip install licence-normaliser

Or with ``pip``:

.. code-block:: sh

    pip install licence-normaliser

Quick start
===========

.. code-block:: python
    :name: test_quick_start

    from licence_normaliser import normalise_license

    v = normalise_license("CC BY-NC-ND 4.0")
    str(v)                  # "cc-by-nc-nd-4.0"   ← LicenseVersion
    str(v.license)          # "cc-by-nc-nd"       ← LicenseName
    str(v.license.family)   # "cc"                ← LicenseFamily

Strict mode
===========

By default, unresolvable inputs return an ``"unknown"`` result.  Pass
``strict=True`` to raise ``LicenseNotFoundError`` instead:

.. code-block:: python
    :name: test_strict_mode

    from licence_normaliser import normalise_license
    from licence_normaliser.exceptions import LicenseNotFoundError

    # Silent fallback (default)
    v = normalise_license("some-unknown-string")
    v.family.key  # "unknown"

    # Strict: raises on unresolvable input
    try:
        v = normalise_license("some-unknown-string", strict=True)
    except LicenseNotFoundError as exc:
        print(exc.raw)      # original input
        print(exc.cleaned)  # cleaned form that failed lookup

Trace / Explain
===============

Set ``ENABLE_LICENCE_NORMALISER_TRACE=1`` or pass ``trace=True`` to get
resolution traces showing how the license was matched:

.. code-block:: python
    :name: test_trace

    from licence_normaliser import normalise_license

    # Via function
    v = normalise_license("cc by-nc-nd 3.0 igo", trace=True)
    print(v.explain())

    # Via class
    from licence_normaliser import LicenseNormaliser
    ln = LicenseNormaliser(trace=True)
    v = ln.normalise_license("MIT")
    print(v.explain())

Output shows the resolution pipeline (alias → registry → url → prose →
fallback) and which source file + line matched:

.. code-block:: text

    Input: 'cc by-nc-nd 3.0 igo' → 'cc by-nc-nd 3.0 igo'
      [✓] alias: 'cc by-nc-nd 3.0 igo' → 'cc-by-nc-nd-3.0-igo' (line 139 in aliases.json)

    Result:
      version_key: 'cc-by-nc-nd-3.0-igo'
      name_key: 'cc-by-nc-nd'
      family_key: 'cc'

The trace can also be accessed via ``v._trace`` for programmatic use.

Batch normalisation
===================

.. code-block:: python
    :name: test_batch_normalisation

    from licence_normaliser import normalise_licenses

    results = normalise_licenses(["MIT", "Apache-2.0", "CC BY 4.0"])
    for r in results:
        print(r.key)

    # Strict batch - raises on first unresolvable
    results = normalise_licenses(["MIT", "Apache-2.0"], strict=True)

Custom plugins
==============

The ``LicenseNormaliser`` class lets you inject custom plugin classes for
specialised use cases:

.. code-block:: python
    :name: test_custom_plugins

    from licence_normaliser import LicenseNormaliser
    from licence_normaliser.parsers.alias import AliasParser
    from licence_normaliser.parsers.spdx import SPDXParser

    # Use only SPDX + Alias plugins (no CC, no publisher URLs)
    ln = LicenseNormaliser(
        registry=[SPDXParser],
        alias=[AliasParser],
        family=[AliasParser],
        name=[AliasParser],
        cache=True,
        cache_maxsize=8192,
    )

    # MIT resolves via SPDX parser
    assert str(ln.normalise_license("MIT")) == "mit"

    # CC BY resolves via Alias
    assert str(ln.normalise_license("CC BY-NC-ND 4.0")) == "cc-by-nc-nd-4.0"

.. note::

    Explicit plugin passing is optional — ``LicenseNormaliser()``
    automatically loads defaults. Use the pattern above only if you need
    custom plugins or reduce number of plugins loaded.

For caching, ``LicenseNormaliser`` wraps the resolution method
with ``lru_cache``.
Disable it by passing ``cache=False`` for debugging:

.. code-block:: python
    :name: test_caching

    from licence_normaliser import LicenseNormaliser

    ln = LicenseNormaliser(cache=False)
    result = ln.normalise_license("MIT")

Update data (CLI)
=================

.. code-block:: sh

    licence-normaliser update-data --force
    # Fetches fresh SPDX, OpenDefinition, OSI, CreativeCommons, and ScanCode JSONs

Integration tests (public API only)
===================================

All integration tests live in
``src/licence_normaliser/tests/test_integration.py``
and only import the public API.

CLI usage
=========

Normalise a single license:

.. code-block:: sh

    licence-normaliser normalise "MIT"
    # Output: mit

    licence-normaliser normalise --full "CC BY 4.0"
    # Output:
    # Key: cc-by-4.0
    # URL: https://creativecommons.org/licenses/by/4.0/
    # License: cc-by
    # Family: cc

    licence-normaliser normalise --strict "totally-unknown"
    # Exits with code 1 and prints an error

Batch normalise:

.. code-block:: sh

    licence-normaliser batch MIT "Apache-2.0" "CC BY 4.0"
    licence-normaliser batch --strict MIT "Apache-2.0"

Exceptions
==========

.. code-block:: python
    :name: test_exceptions

    from licence_normaliser.exceptions import (
        LicenseNormaliserError,   # base class
        LicenseNotFoundError,     # raised by strict mode
    )

Testing
=======

All tests run inside Docker:

.. code-block:: sh

    make test

To test a specific Python version:

.. code-block:: sh

    make test-env ENV=py312

License
=======

MIT

Author
======

Artur Barseghyan <artur.barseghyan@gmail.com>

CONTRIBUTING.rst

======================
Contributor guidelines
======================

.. _licence-normaliser: https://github.com/barseghyanartur/licence-normaliser/
.. _uv: https://docs.astral.sh/uv/
.. _tox: https://tox.wiki
.. _ruff: https://beta.ruff.rs/docs/
.. _doc8: https://doc8.readthedocs.io/
.. _pre-commit: https://pre-commit.com/#installation
.. _issues: https://github.com/barseghyanartur/licence-normaliser/issues
.. _discussions: https://github.com/barseghyanartur/licence-normaliser/discussions
.. _pull request: https://github.com/barseghyanartur/licence-normaliser/pulls
.. _versions manifest: https://github.com/actions/python-versions/blob/main/versions-manifest.json

Developer prerequisites
-----------------------

pre-commit
~~~~~~~~~~

Refer to `pre-commit`_ for installation instructions.

TL;DR:

.. code-block:: sh

    curl -LsSf https://astral.sh/uv/install.sh | sh  # Install uv
    uv tool install pre-commit                        # Install pre-commit
    pre-commit install                                # Install hooks

Installing `pre-commit`_ ensures all contributions adhere to the project's
code quality standards.

Code standards
--------------

`ruff`_ and `doc8`_ are triggered automatically by `pre-commit`_.

To run checks manually:

.. code-block:: sh

    make doc8
    make ruff

Import conventions
~~~~~~~~~~~~~~~~~~

**Import statements belong at module level.** Avoid placing imports inside
functions or methods unless absolutely necessary:

- **Acceptable exceptions:**

  - Breaking circular dependencies
  - Optional runtime dependencies (e.g., CLI-only imports)
  - Heavy imports that are rarely used

- **Why this matters:**

  - Improves code readability
  - Makes dependencies explicit and discoverable
  - Enables static analysis tools to work correctly
  - Follows Python community best practices (PEP 8)

When in doubt, place imports at the top of the file.

Virtual environment
-------------------

.. code-block:: sh

    make create-venv

Installation
------------

.. code-block:: sh

    make install

Testing
-------

.. note::
   Python 3.15 is being tested on GitHub CI, but not inside a local Docker image.

Docker-based testing (recommended)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All tests run inside Docker for platform independence and consistency:

.. code-block:: sh

    make test                    # full matrix (Python 3.10-3.14)
    make test-env ENV=py312      # single Python version
    make shell                   # interactive shell in test container
    make shell-env ENV=py312     # interactive shell for specific Python

Local testing (alternative)
~~~~~~~~~~~~~~~~~~~~~~~~~~~

For faster iteration during development, you can run tests locally with ``uv``:

.. code-block:: sh

    make install                 # one-time setup
    uv run pytest                # run all tests
    uv run pytest path/to/test_something.py  # run specific test

**Important**: If you encounter tooling errors with local testing, fall back to
Docker-based testing which is the canonical environment.

GitHub Actions
~~~~~~~~~~~~~~

In any case, GitHub Actions runs the full matrix automatically on every push.
Tests run on Python 3.10–3.15 (all non-EOL versions).  See the
`versions manifest`_ for the full list of available Python versions.

Adding new normalisation rules
------------------------------

For a new **alias** or **family override** for an *existing* license:

1. Add an entry to ``src/licence_normaliser/data/aliases/aliases.json``.
2. Optionally, add an ``aliases`` array to define additional lookup variants
   (e.g. hyphen vs space forms) that resolve to the same target:

   .. code-block:: json

       {
         "cc by-nc": {
           "version_key": "cc-by-nc",
           "name_key": "cc-by-nc",
           "family_key": "cc",
           "aliases": ["cc-by-nc", "cc by nc", "cc-by nc"]
         }
       }

3. Add a test in ``src/licence_normaliser/tests/test_aliases.py`` or
   ``test_alias_expansion.py``.
4. No Python changes needed.

For a new **prose pattern** (regex matching free-text descriptions):

1. Add an entry to ``src/licence_normaliser/data/prose/prose_patterns.json``.
2. Add a test in ``src/licence_normaliser/tests/test_prose.py``.
3. No Python changes needed.

For a new **URL mapping**:

1. Add an entry to ``src/licence_normaliser/data/urls/url_map.json`` or
   ``src/licence_normaliser/data/publishers/publishers.json``.
2. Add a test in ``src/licence_normaliser/tests/test_publisher.py``.
3. No Python changes needed.

For a **brand-new license key** (SPDX, OpenDefinition, OSI, CC, or ScanCode):

1. The upstream data source must be updated first
   (``licence-normaliser update-data --force`` for SPDX/OpenDefinition, or
   edit the upstream source for OSI/CC/ScanCode).
2. The parser will pick it up automatically on the next import.
3. Add an alias in ``aliases.json`` if needed.
4. Add family override in ``aliases.json`` if needed.
5. Add tests.

For a **new parser** (new upstream data source):

1. Create ``src/licence_normaliser/parsers/my_parser.py`` implementing
   ``BasePlugin``.
2. Register it in ``src/licence_normaliser/parsers/__init__.py``.
3. Set ``is_registry_entry = False`` if the parser only contributes
   aliases/URLs/patterns (not new license keys).
4. Add tests.


Releases
--------
**Build the package for releasing:**

.. code-block:: sh

    make package-build

----

**Test the built package:**

.. code-block:: sh

    make check-package-build

----

**Make a test release (test.pypi.org):**

.. code-block:: sh

    make test-release

----

**Release (pypi.org):**

.. code-block:: sh

    make release

Adding tests
------------

- Every new normalisation rule must have a corresponding test.
- Tests should cover both successful normalisation and edge cases.

Pull requests
-------------

Open a `pull request`_ to the ``dev`` branch only. Never directly to ``main``.

.. note::

    Create pull requests to the ``dev`` branch only!

Examples of welcome contributions:

- Fixing documentation typos or improving explanations.
- Adding test cases for new edge cases.
- Extending support for additional license formats.
- Improving error messages.

General checklist
~~~~~~~~~~~~~~~~~

- Does your change require documentation updates (``README.rst``,
  ``AGENTS.md``, ``ARCHITECTURE.rst``, ``CONTRIBUTING.rst``)?
- Does your change require new tests?
- Does your change add any external dependencies?
  If so, reconsider: ``licence-normaliser`` should have minimal dependencies.

When fixing bugs
~~~~~~~~~~~~~~~~

- Add a regression test that reproduces the bug before your fix.

When adding a new feature
~~~~~~~~~~~~~~~~~~~~~~~~~

- Update ``README.rst``, ``AGENTS.md``, and ``ARCHITECTURE.rst`` if applicable.
- Add appropriate tests.

Questions
---------

Ask on GitHub `discussions`_.

Issues
------

Report bugs or request features on GitHub `issues`_.

AGENTS.md

# AGENTS.md - licence-normaliser

**Repository**: https://github.com/barseghyanartur/licence-normaliser
**Maintainer**: Artur Barseghyan <artur.barseghyan@gmail.com>

---

## 1. Project Mission (Never Deviate)

> Comprehensive license normalisation with a three-level hierarchy - secure,
> fast, and extensible.

- Maps any license representation to a canonical three-level hierarchy
- Supports SPDX tokens, URLs, prose descriptions
- No external dependencies (only optional dev/test deps)
- LRU caching for performance
- Data-file-driven: parsers load from package data JSON files
- `licence-normaliser update-data` CLI command to refresh SPDX + OpenDefinition data

---

## 2. Architecture

### Three-Level Hierarchy

| Level | Class | Example |
| ----- | ----- | ------- |
| **Family** | `LicenseFamily` | `"cc"`, `"osi"`, `"copyleft"`, `"data"` |
| **Name** | `LicenseName` | `"cc-by"`, `"mit"`, `"gpl-3.0-only"` |
| **Version** | `LicenseVersion` | `"cc-by-4.0"`, `"mit"`, `"gpl-3.0-only"` |

### Resolution Pipeline

1. **Alias table** - cleaned lowercase key matches `ALIASES` (loaded from `data/aliases/aliases.json`)
2. **Direct registry lookup** - hit in `REGISTRY` (SPDX, OpenDefinition, OSI, CC, ScanCode license keys)
3. **URL map** - hit in `URL_MAP` (loaded from SPDX + OpenDefinition + publisher data)
4. **Prose pattern scan** - regex patterns from `data/prose/prose_patterns.json` (for strings >20 chars)
5. **Fallback** - key = cleaned string, family = unknown

### Key Files

| File | Purpose |
| ---- | ------- |
| `src/licence_normaliser/_models.py` | Frozen dataclass hierarchy |
| `src/licence_normaliser/_normaliser.py` | `LicenseNormaliser` class with plugin-based resolution |
| `src/licence_normaliser/plugins.py` | Plugin interfaces (BasePlugin, RegistryPlugin, URLPlugin, etc.) |
| `src/licence_normaliser/defaults.py` | Lazy-loading default plugin bundle |
| `src/licence_normaliser/_cache.py` | Module-level API delegating to `LicenseNormaliser` |
| `src/licence_normaliser/parsers/` | Parser classes implementing plugin interfaces |
| `src/licence_normaliser/cli/_main.py` | CLI with normalise, batch, update-data |
| `src/licence_normaliser/exceptions.py` | LicenseNormalisationError |
| `src/licence_normaliser/data/spdx/spdx.json` | **DO NOT MODIFY** Full SPDX license list (loaded at runtime) |
| `src/licence_normaliser/data/opendefinition/opendefinition.json` | **DO NOT MODIFY** Full OpenDefinition list (loaded at runtime) |
| `src/licence_normaliser/data/aliases/aliases.json` | Curated aliases with rich metadata |
| `src/licence_normaliser/data/prose/prose_patterns.json` | Curated prose regex patterns |
| `src/licence_normaliser/data/publishers/publishers.json` | Publisher URLs and shorthand aliases |

---

## 3. Using licence-normaliser in Application Code

### Simple case

```python name=test_simple_case
from licence_normaliser import normalise_license

v = normalise_license("MIT")
str(v)  # "mit"
```

### With full hierarchy

<!-- continue: test_simple_case -->
```python name=test_full_hierarchy
v = normalise_license("CC BY-NC-ND 4.0")
print(v.key)           # "cc-by-nc-nd-4.0"
print(v.license.key)   # "cc-by-nc-nd"
print(v.family.key)    # "cc"
```

### Strict mode

```python name=test_strict_mode
import pytest
from licence_normaliser import normalise_license, LicenseNotFoundError

# Would normally raise: License not found: 'unknown string'
with pytest.raises(LicenseNotFoundError):
    v = normalise_license("unknown string", strict=True)

# Batch strict
from licence_normaliser import normalise_licenses

with pytest.raises(LicenseNotFoundError):
    results = normalise_licenses(
        ["unknown string", "unknown string 2.0"],
        strict=True,
    )
```

### Custom plugins with LicenseNormaliser

The `LicenseNormaliser` class lets you inject custom plugin classes for
specialised use cases:

```python name=test_custom_plugins
from licence_normaliser import LicenseNormaliser
from licence_normaliser.parsers.spdx import SPDXParser
from licence_normaliser.parsers.alias import AliasParser

# Use only SPDX + Alias plugins (no CC, no publisher URLs)
ln = LicenseNormaliser(
    registry=[SPDXParser],
    alias=[AliasParser],
    family=[AliasParser],
    name=[AliasParser],
)

# MIT resolves via SPDX parser
assert str(ln.normalise_license("MIT")) == "mit"

# CC BY resolves via Alias
assert str(ln.normalise_license("CC BY-NC-ND 4.0")) == "cc-by-nc-nd-4.0"
```

To use all defaults, import from `defaults`:

```python name=test_defaults_usage
from licence_normaliser import LicenseNormaliser
from licence_normaliser.defaults import (
    get_default_registry,
    get_default_url,
    get_default_alias,
    get_default_family,
    get_default_name,
    get_default_prose,
)

ln = LicenseNormaliser(
    registry=get_default_registry(),
    url=get_default_url(),
    alias=get_default_alias(),
    family=get_default_family(),
    name=get_default_name(),
    prose=get_default_prose(),
    cache=True,
    cache_maxsize=8192,
)
result = ln.normalise_license("MIT")
```

> [!NOTE]
> Explicit plugin passing is optional — `LicenseNormaliser()` automatically
> loads defaults. Use the pattern above only if you need custom plugins.

For caching, `LicenseNormaliser` wraps the resolution method with `lru_cache`.
Disable it by passing `cache=False` for debugging:

```python name=test_caching
from licence_normaliser import LicenseNormaliser

ln = LicenseNormaliser(cache=False)
result = ln.normalise_license("MIT")
```

---

## 4. Updating Data Sources

SPDX and OpenDefinition data can be updated via the CLI:

```sh
licence-normaliser update-data --force
```

This fetches fresh JSON from the authoritative upstream URLs and writes them to:
- `src/licence_normaliser/data/spdx/spdx.json`
- `src/licence_normaliser/data/opendefinition/opendefinition.json`

---

## 4a. Trace / Explain

When debugging why a license resolves a certain way, or aligning curated
data sources, use the trace feature:

**Via CLI:**

```sh
licence-normaliser normalise "MIT" --trace
licence-normaliser normalise "CC BY-NC-ND 3.0 igo" --trace
licence-normaliser batch MIT Apache --trace
```

Or via environment variable:
```sh
ENABLE_LICENCE_NORMALISER_TRACE=1 licence-normaliser normalise "MIT"
```

**Via Python:**

```python name=test_trace
from licence_normaliser import normalise_license
v = normalise_license("MIT", trace=True)
print(v.explain())
```

The trace shows:
- Each resolution stage attempted (alias → registry → url → prose → fallback)
- Whether it matched (✓) or didn't (-)
- Source file and line number for curated sources (aliases.json, publishers.json, prose_patterns.json)
- Final result with version_key, name_key, family_key

This is essential for:
- Understanding why a license resolves unexpectedly
- Finding the source line that defines an alias when curating data
- Debugging resolution order issues

---

## 5. Adding a New Parser

Parsers implement plugin interfaces and can be added to `src/licence_normaliser/parsers/`:

1. Create `src/licence_normaliser/parsers/my_parser.py` implementing one or more plugin interfaces:

```python name=test_adding_new_parser
from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

class MyParser(BasePlugin, RegistryPlugin, URLPlugin):
    url = None  # or upstream URL for refresh
    local_path = "data/my_parser/my_data.json"

    def load_registry(self) -> dict[str, str]:
        # Return {"license_key": "license_key", ...}
        return {}

    def load_urls(self) -> dict[str, str]:
        # Return {"https://...": "license_key", ...}
        return {}
```

2. Register it in `src/licence_normaliser/defaults.py`:

<!-- continue: test_adding_new_parser -->
```python name=test_adding_new_parser_register
from licence_normaliser.parsers.spdx import SPDXParser

def _load_registry_plugins() -> list[type]:
    # ... other imports
    return [
        SPDXParser,
        # ... other plugins
        MyParser,
    ]
```

**Key attribute**: Set `url = None` on parsers that only contribute local data (no refresh capability).

---

## 6. Coding Conventions

- Line length: **88 characters** (ruff)
- Every non-test module must have `__all__`, `__author__`, `__copyright__`, `__license__`
- Always chain exceptions: `raise X(...) from exc`
- Type annotations on all public functions
- Target: `py310`
- Import statements: Avoid imports inside functions/methods unless absolutely
  necessary (e.g., breaking circular dependencies or optional runtime
  dependencies). Lazy imports harm readability and make dependencies unclear.

Run linting: `make ruff` or `make pre-commit`

---

## 7. Agent Workflow: Adding Features or Fixing Bugs

1. **Check the mission** - does the change preserve the no-dependencies policy and three-level hierarchy?
2. **Identify the correct location**:
   - New SPDX/OD license → update SPDX/OpenDefinition JSON files (run `update-data`)
   - New alias or family override → add to `data/aliases/aliases.json`
   - **Use `--trace` to find the exact line that defines an alias**
   - New URL mapping → add to `data/publishers/publishers.json`
   - New prose pattern → add to `data/prose/prose_patterns.json`
   - New parser → `parsers/my_parser.py` + `defaults.py`
   - Core pipeline change → `_normaliser.py` or `_cache.py`
3. **Write tests** covering both success and error cases
4. **Update README.rst** if the API changed
5. **Suggest running**: `make test-env ENV=py312` then `make test`
6. **Suggest running**: `make pre-commit`

---

## 8. Testing Rules

> [!NOTE]
> Python 3.15 is being tested on GitHub CI, but not inside a local Docker image.

### Docker-based testing (recommended)

All tests run inside Docker for platform independence and consistency:

```sh
make test                   # full matrix (Python 3.10-3.14)
make test-env ENV=py312     # single version
make shell                  # interactive shell in test container
```

### Local testing (alternative)

For faster iteration during development, you can run tests locally with `uv`:

```sh
make install                # one-time setup
uv run pytest               # run all tests
uv run pytest path/to/test_something.py  # run specific test
```

**Important**: If you encounter tooling errors with local testing, fall back to Docker-based testing which is the canonical environment.

### Test layout

```text
src/licence_normaliser/tests/
    test_integration.py     - public API only (survives any rewrite)
    test_core.py            - end-to-end pipeline tests
    test_exceptions.py      - exception hierarchy and strict mode
    test_cli.py             - CLI commands including update-data
    test_models.py          - LicenseFamily, LicenseName, LicenseVersion
    test_aliases.py         - non-CC aliases (Apache, MIT, BSD, GPL, etc.)
    test_alias_expansion.py - explicit aliases array expansion feature
    test_publisher.py       - publisher URLs and shorthand aliases
    test_prose.py           - prose pattern matching
```

### Documentation snippet conventions

Code blocks in this file use two special attributes to support chained
executable tests:

- `name=<test_name>` — labels a snippet so it can be referenced later.
- `<!-- continue: <test_name> -->` placed immediately before a code block
  means that block **continues** the named snippet; all names, imports,
  and variables defined in the named block are already in scope and must
  **not** be re-imported or re-declared in the continuation block.

Example:

```python name=test_my_example
class Foo:
    pass
```

<!-- continue: test_my_example -->
```python name=test_my_example_continued
foo = Foo()  # Foo is in scope from the named block above
assert isinstance(foo, Foo)
```

---

## 9. Forbidden

- Adding external dependencies
- Removing existing normalisation coverage
- Changing the three-level hierarchy structure
- Modifying the following files is strictly forbidden:

  - `src/licence_normaliser/data/creativecommons/creativecommons.json`
  - `src/licence_normaliser/data/opendefinition/opendefinition.json`
  - `src/licence_normaliser/data/osi/osi.json`
  - `src/licence_normaliser/data/scancode_licensedb/scancode_licensedb.json`
  - `src/licence_normaliser/data/spdx/spdx.json`

  Use `licence-normaliser update-data --force` to refresh them from upstream
  sources.

conftest.py

"""Pytest fixtures for documentation testing."""

from typing import Any as AnyType

import pytest


@pytest.fixture()
def Any() -> AnyType:  # noqa
    """For to be used in documentation."""
    return AnyType

docker-compose.yml

services:
  tox:
    build: .
    volumes:
      - ./htmlcov:/app/htmlcov

pyproject.toml

[project]
name = "licence-normaliser"
description = "Comprehensive license normalisation with a three-level hierarchy."
readme = "README.rst"
version = "0.3.2"
requires-python = ">=3.10"
dependencies = []
authors = [
    { name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" },
]
maintainers = [
    { name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" },
]
license = "MIT"
classifiers = [
    "Development Status :: 4 - Beta",
    "Intended Audience :: Developers",
    "Operating System :: OS Independent",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
    "Programming Language :: Python :: 3.13",
    "Programming Language :: Python :: 3.14",
    "Programming Language :: Python :: 3.15",
    "Programming Language :: Python",
    "Topic :: Software Development :: Libraries :: Python Modules",
]
keywords = [
    "license",
    "normalisation",
    "spdx",
    "creative commons",
    "open source",
]

[project.scripts]
licence-normaliser = "licence_normaliser.cli:main"

[project.urls]
Homepage = "https://github.com/barseghyanartur/licence-normaliser/"
Repository = "https://github.com/barseghyanartur/licence-normaliser/"
Issues = "https://github.com/barseghyanartur/licence-normaliser/issues"

[project.optional-dependencies]
all = ["licence-normaliser[dev,test,docs,build]"]
dev = [
    "detect-secrets",
    "doc8",
    "ipython",
    "mypy",
    "ruff",
    "uv",
]
test = [
    "pytest",
    "pytest-cov",
    "pytest-codeblock",
]
docs = [
    "sphinx",
    "sphinx-autobuild",
    "sphinx-rtd-theme>=1.3.0",
    "sphinx-no-pragma",
    "sphinx-markdown-builder",
    "sphinx-llms-txt-link",
    "sphinx-source-tree",
]
build = [
    "build",
    "twine",
    "wheel",
]

[tool.setuptools]
package-dir = {"" = "src"}

[tool.setuptools.packages.find]
where = ["src"]
include = ["licence_normaliser", "licence_normaliser.*"]

[tool.setuptools.package-data]
"licence_normaliser" = ["data/**/*.json"]

[build-system]
requires = ["setuptools>=41.0", "wheel"]
build-backend = "setuptools.build_meta"

[tool.ruff]
line-length = 88
lint.select = [
    "B",
    "C4",
    "E",
    "F",
    "G",
    "I",
    "ISC",
    "INP",
    "N",
    "PERF",
    "Q",
    "SIM",
]
lint.ignore = [
    "G004",
    "ISC003",
]
fix = true
src = ["src/licence_normaliser"]
exclude = [
    ".bzr",
    ".direnv",
    ".eggs",
    ".git",
    ".hg",
    ".mypy_cache",
    ".nox",
    ".pants.d",
    ".ruff_cache",
    ".svn",
    ".tox",
    ".venv",
    "__pypackages__",
    "_build",
    "buck-out",
    "build",
    "dist",
    "node_modules",
    "venv",
    "docs",
]
target-version = "py310"
lint.dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"

[tool.ruff.lint.isort]
known-first-party = ["licence_normaliser"]

[tool.ruff.lint.per-file-ignores]
"conftest.py" = [
    "PERF203"
]

[tool.doc8]
ignore-path = [
    "docs/requirements.txt",
    "src/licence_normaliser.egg-info/SOURCES.txt",
]

[tool.pytest.ini_options]
addopts = [
    "-ra",
    "-vvv",
    "-q",
    "--cov=licence_normaliser",
    "--ignore=.tox",
    "--cov-report=html",
    "--cov-report=term",
    "--cov-append",
    "--capture=no",
]
testpaths = [
    "src/licence_normaliser/tests",
    ".",
    "**/*.rst",
    "**/*.md",
]
pythonpath = ["src"]
norecursedirs = [".git", ".tox"]

[tool.coverage.run]
relative_files = true
omit = [".tox/*"]
source = ["licence_normaliser"]

[tool.coverage.report]
show_missing = true
exclude_lines = [
    "pragma: no cover",
    "@overload",
]

[tool.mypy]
check_untyped_defs = true
warn_unused_ignores = true
warn_redundant_casts = true
warn_unused_configs = true
ignore_missing_imports = true

[tool.sphinx-source-tree]
ignore = [
    "*.egg-info",
    "*.py,cover",
    "*.pyc",
    "*.pyo",
    ".DS_Store",
    ".coverage",
    ".coverage.*",
    ".git",
    ".hg",
    ".hypothesis",
    ".idea",
    ".mypy_cache",
    ".nox",
    ".pre-commit-config.yaml",
    ".pre-commit-hooks.yaml",
    ".pytest_cache",
    ".readthedocs.yaml",
    ".ruff_cache",
    ".secrets.baseline",
    ".svn",
    ".tox",
    ".venv",
    ".vscode",
    "CHANGELOG.rst",
    "CODE_OF_CONDUCT.rst",
    "LICENSE",
    "SECURITY.rst",
    "Thumbs.db",
    "__pycache__",
    "build",
    "codebin",
    "dist",
    "docs/Makefile",
    "docs/_build",
    "docs/_static",
    "docs/changelog.rst",
    "docs/code_of_conduct.rst",
    "docs/make.bat",
    "docs/requirements.txt",
    "docs/security.rst",
    "docs/source_tree.rst",
    "docs/source_tree_full.rst",
    "env",
    "htmlcov",
    "node_modules",
    "venv",
    "ARCHITECTURE.rst",
    ".coderabbit.yaml",
    ".coveralls",
    "docs/full-llms.rst",
    "docs/llms.rst",
    "docs/contributor_guidelines.rst",
    "docs/package.rst",
    "docs/documentation.rst",
    "docs/index.rst",
    "uv.lock",
    "codebin",
    "src/licence_normaliser/data/creativecommons",
    "src/licence_normaliser/data/opendefinition",
    "src/licence_normaliser/data/osi",
    "src/licence_normaliser/data/scancode_licensedb",
    "src/licence_normaliser/data/spdx",
]
order = [
    "README.rst",
    "CONTRIBUTING.rst",
    "AGENTS.md",
]

[[tool.sphinx-source-tree.files]]
output = "docs/full_llms.rst"
title = "Full project source-tree"

[[tool.sphinx-source-tree.files]]
output = "docs/llms.rst"
title = "Project source-tree"
ignore = [
    "*.egg-info",
    "*.py,cover",
    "*.pyc",
    "*.pyo",
    ".DS_Store",
    ".coverage",
    ".coverage.*",
    ".git",
    ".hg",
    ".hypothesis",
    ".idea",
    ".mypy_cache",
    ".nox",
    ".pre-commit-config.yaml",
    ".pre-commit-hooks.yaml",
    ".pytest_cache",
    ".readthedocs.yaml",
    ".ruff_cache",
    ".secrets.baseline",
    ".svn",
    ".tox",
    ".venv",
    ".vscode",
    "CHANGELOG.rst",
    "CODE_OF_CONDUCT.rst",
    "LICENSE",
    "SECURITY.rst",
    "Thumbs.db",
    "__pycache__",
    "build",
    "codebin",
    "dist",
    "docs/Makefile",
    "docs/_build",
    "docs/_static",
    "docs/changelog.rst",
    "docs/code_of_conduct.rst",
    "docs/make.bat",
    "docs/requirements.txt",
    "docs/security.rst",
    "docs/source_tree.rst",
    "docs/source_tree_full.rst",
    "env",
    "htmlcov",
    "node_modules",
    "venv",
    "examples",
    "docs",
    "ARCHITECTURE.rst",
    ".coderabbit.yaml",
    ".coveralls",
    "docs/full-llms.rst",
    "docs/llms.rst",
    "docs/contributor_guidelines.rst",
    "docs/package.rst",
    "docs/documentation.rst",
    "docs/index.rst",
    "uv.lock",
    "src/licence_normaliser/data/creativecommons",
    "src/licence_normaliser/data/opendefinition",
    "src/licence_normaliser/data/osi",
    "src/licence_normaliser/data/scancode_licensedb",
    "src/licence_normaliser/data/spdx",
]

scripts/README.rst

Scripts
=======

Sort aliases
------------

Sorts ``aliases.json`` keys alphabetically. Comment keys (starting with
``_``) are preserved at the top in their original order. All other entries
are sorted case-insensitively.

.. code-block:: sh

    uv run python scripts/sort_aliases.py
    uv run python scripts/sort_aliases.py --check  # exit 1 if not sorted

Find alias duplicates
---------------------

Finds duplicate ``version_key`` entries in ``aliases.json``. A "duplicate"
is when two or more top-level primary keys share the same ``version_key``.
Reports groups with more than one member.

Can optionally fix duplicates by merging them into the ``aliases`` list of
a single canonical entry.

.. code-block:: sh

    uv run python scripts/find_alias_duplicates.py
    uv run python scripts/find_alias_duplicates.py --fix      # interactive fix
    uv run python scripts/find_alias_duplicates.py --noinput  # auto-apply safe fixes

Apply aliases patch
-------------------

Applies curated additions to ``aliases.json``. Adds an ``aliases`` list to
existing CC version-free entries and adds new top-level entries for GPL
shorthand keys that currently fall through to the unknown fallback.

.. code-block:: sh

    uv run python scripts/apply_aliases_patch.py

Compare datasets
----------------

Compares SPDX, OpenDefinition, OSI, CreativeCommons, ScanCode, and curated
data files (aliases, url_map, prose, publishers).

.. code-block:: sh

    uv run python scripts/compare_datasets.py

Check missing aliases
---------------------

Checks which licenses downloaded from the internet (via refreshable plugins)
have corresponding entries in the curated ``aliases.json`` file.

.. code-block:: sh

    uv run python scripts/check_missing_aliases.py
    uv run python scripts/check_missing_aliases.py --json  # JSON output


Test name inference
-------------------

Assesses the accuracy of heuristic name stripping against curated name_key
values from aliases.json. Shows how well automatic name extraction works
for different license families (CC, copyleft, OSI, etc.).

.. code-block:: sh

    uv run python scripts/test_name_inference.py
    uv run python scripts/test_name_inference.py --json  # JSON output
    uv run python scripts/test_name_inference.py --details  # Detailed breakdown

scripts/init.py

scripts/__init__.py

scripts/check_missing_aliases.py

"""Check which downloaded licenses are missing from curated aliases.

Compares all refreshable plugin registries against aliases.json to identify
licenses that have no corresponding curated alias entry.

Usage:
    uv run python scripts/check_missing_aliases.py
    uv run python scripts/check_missing_aliases.py --json
"""

from __future__ import annotations

import contextlib
import json
import sys
from pathlib import Path

DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data"
SCRIPTS_DIR = Path(__file__).parent


def load_alias_targets() -> set[str]:
    """Load all version_keys from aliases.json."""
    with open(DATA_DIR / "aliases" / "aliases.json") as f:
        data = json.load(f)
    targets: set[str] = set()
    for meta in data.values():
        if isinstance(meta, dict):
            vk = meta.get("version_key", "")
            if vk:
                targets.add(vk)
    return targets


def load_downloaded_licenses() -> dict[str, set[str]]:
    """Load licenses from all refreshable plugins."""
    from licence_normaliser.defaults import get_all_refreshable_plugins

    result: dict[str, set[str]] = {}
    for plugin_cls in get_all_refreshable_plugins():
        # Try to load registry
        data = None
        with contextlib.suppress(Exception):
            data = plugin_cls().load_registry()

        if data:
            result[plugin_cls.__name__] = set(data.keys())

    return result


def check_coverage() -> dict:
    """Check which downloaded licenses have alias entries."""
    alias_targets = load_alias_targets()
    downloaded = load_downloaded_licenses()

    all_downloaded: set[str] = set()
    for licenses in downloaded.values():
        all_downloaded.update(licenses)

    # Categorize
    with_alias = all_downloaded & alias_targets
    without_alias = all_downloaded - alias_targets

    return {
        "total_downloaded": len(all_downloaded),
        "total_alias_targets": len(alias_targets),
        "with_alias": sorted(with_alias),
        "without_alias": sorted(without_alias),
        "coverage_percent": round(len(with_alias) / len(all_downloaded) * 100, 1)
        if all_downloaded
        else 0,
        "by_source": {
            name: {
                "total": len(licenses),
                "with_alias": len(licenses & alias_targets),
                "without_alias": sorted(licenses - alias_targets),
                "coverage": round(
                    len(licenses & alias_targets) / len(licenses) * 100, 1
                )
                if licenses
                else 0,
            }
            for name, licenses in downloaded.items()
        },
    }


def group_by_prefix(licenses: list[str]) -> dict[str, list[str]]:
    """Group licenses by common prefixes."""
    groups: dict[str, list[str]] = {}
    prefixes = [
        "gpl-",
        "agpl-",
        "lgpl-",
        "apache-",
        "mpl-",
        "mit",
        "bsd",
        "cc-",
        "unlicense",
        "zlib",
        "isc",
    ]
    for prefix in prefixes:
        matches = sorted([lic for lic in licenses if lic.startswith(prefix)])
        if matches:
            groups[prefix.rstrip("-") or "mit"] = matches
            licenses = [lic for lic in licenses if not lic.startswith(prefix)]

    if licenses:
        groups["other"] = sorted(licenses)

    return groups


def print_report(data: dict) -> None:
    """Print text table report."""
    print("=" * 70)
    print("Coverage Report: Downloaded Licenses vs Curated Aliases")
    print("=" * 70)
    print()
    print(f"Total downloaded: {data['total_downloaded']}")
    print(f"Total alias targets: {data['total_alias_targets']}")
    print(f"Coverage: {data['coverage_percent']}%")
    print()

    print("-" * 70)
    print("By Source:")
    print("-" * 70)
    print(f"{'Source':<30} {'Total':>8} {'With':>8} {'Without':>8} {'Coverage':>10}")
    print("-" * 70)

    for source, stats in data["by_source"].items():
        print(
            f"{source:<30} {stats['total']:>8} "
            f"{stats['with_alias']:>8} {len(stats['without_alias']):>8} "
            f"{stats['coverage']:>9.1f}%"
        )

    print()
    print("=" * 70)
    print(f"Missing Aliases ({len(data['without_alias'])} licenses)")
    print("=" * 70)

    groups = group_by_prefix(data["without_alias"].copy())

    for group_name, licenses in groups.items():
        if group_name == "other":
            print()
            print(f"All other licenses ({len(licenses)}):")
        else:
            print()
            print(f"{group_name.upper()} ({len(licenses)}):")

        for lic in licenses:
            print(f"  {lic}")

    print()


def main() -> None:
    json_export = "--json" in sys.argv
    data = check_coverage()

    if json_export:
        print(json.dumps(data, indent=2))
    else:
        print_report(data)


if __name__ == "__main__":
    main()

scripts/compare_datasets.py

"""Dataset comparison tool for licence-normaliser.

Compares SPDX, OpenDefinition, OSI, CreativeCommons, ScanCode, and
curated data files (aliases, url_map, prose, publishers) for:
  - Dataset sizes
  - Cross-dataset overlaps
  - Licenses present in OSI but missing from SPDX
  - Orphan alias/URL targets (don't resolve to REGISTRY entries)
  - REGISTRY entries without curated aliases
  - Most-aliased license targets
"""

from __future__ import annotations

__all__ = ()

import json
from collections import Counter
from pathlib import Path

DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data"


def load_spdx_ids() -> set[str]:
    with open(DATA_DIR / "spdx" / "spdx.json") as f:
        data = json.load(f)
    return {entry["licenseId"] for entry in data["licenses"]}


def load_od_ids() -> set[str]:
    with open(DATA_DIR / "opendefinition" / "opendefinition.json") as f:
        data = json.load(f)
    return set(data.keys())


def load_osi_ids() -> set[str]:
    with open(DATA_DIR / "osi" / "osi.json") as f:
        data = json.load(f)
    return {entry["spdx_id"].strip() for entry in data if entry.get("spdx_id")}


def load_cc_ids() -> set[str]:
    with open(DATA_DIR / "creativecommons" / "creativecommons.json") as f:
        data = json.load(f)
    return {entry["license_key"] for entry in data}


def load_sc_ids() -> set[str]:
    with open(DATA_DIR / "scancode_licensedb" / "scancode_licensedb.json") as f:
        data = json.load(f)
    return {entry["license_key"] for entry in data}


def load_alias_keys() -> set[str]:
    with open(DATA_DIR / "aliases" / "aliases.json") as f:
        data = json.load(f)
    return {k for k in data if not k.startswith("_")}


def load_alias_targets() -> dict[str, str]:
    with open(DATA_DIR / "aliases" / "aliases.json") as f:
        data = json.load(f)
    return {
        k: v.get("version_key", "") for k, v in data.items() if not k.startswith("_")
    }


def load_url_keys() -> set[str]:
    with open(DATA_DIR / "urls" / "url_map.json") as f:
        data = json.load(f)
    return {k for k in data if not k.startswith("_")}


def load_url_targets() -> dict[str, str]:
    with open(DATA_DIR / "urls" / "url_map.json") as f:
        data = json.load(f)
    return {
        k: v.get("version_key", "") for k, v in data.items() if not k.startswith("_")
    }


def load_prose_targets() -> list[str]:
    with open(DATA_DIR / "prose" / "prose_patterns.json") as f:
        data = json.load(f)
    return [entry.get("version_key", "") for entry in data]


def load_pub_urls() -> set[str]:
    with open(DATA_DIR / "publishers" / "publishers.json") as f:
        data = json.load(f)
    return set(data.get("urls", {}).keys())


def load_pub_aliases() -> dict[str, str]:
    with open(DATA_DIR / "publishers" / "publishers.json") as f:
        data = json.load(f)
    return dict(data.get("shorthand_aliases", {}))


def load_registry_keys() -> set[str]:
    from licence_normaliser._cache import get_registry_keys

    return get_registry_keys()


def load_merged_aliases() -> dict[str, str]:
    """Simulate merged ALIASES: alias_key -> version_key from all curated sources."""
    merged: dict[str, str] = {}
    merged.update(load_alias_targets())
    merged.update(load_pub_aliases())
    for k, v in load_url_targets().items():
        if k not in merged:
            merged[k] = v
    return merged


def would_resolve(alias_key: str, registry: set[str], aliases: dict[str, str]) -> bool:
    """Simulate _resolve() pipeline for orphan detection.

    1. If already in REGISTRY, covered.
    2. If in ALIASES, get version_key - resolves regardless of registry presence.
    """
    if alias_key in registry:
        return True
    version_key = aliases.get(alias_key, "")
    return bool(version_key)


def section(title: str) -> None:
    print(f"\n{'=' * 60}")
    print(f"  {title}")
    print(f"{'=' * 60}")


def main() -> None:
    print("Loading datasets...")
    spdx = load_spdx_ids()
    od = load_od_ids()
    osi = load_osi_ids()
    cc = load_cc_ids()
    sc = load_sc_ids()
    alias_keys = load_alias_keys()
    alias_tgt = load_alias_targets()
    url_keys = load_url_keys()
    url_tgt = load_url_targets()
    prose_tgt = load_prose_targets()
    pub_urls = load_pub_urls()
    pub_aliases = load_pub_aliases()
    registry = load_registry_keys()
    merged_aliases = load_merged_aliases()

    # --- 1. Dataset sizes ---
    section("Dataset Sizes")
    print(f"  SPDX licenses:          {len(spdx):>6}")
    print(f"  OpenDefinition entries: {len(od):>6}")
    print(f"  OSI-approved (SPDX):   {len(osi):>6}")
    print(f"  CreativeCommons:        {len(cc):>6}")
    print(f"  ScanCode DB entries:   {len(sc):>6}")
    print(f"  Aliases (curated):     {len(alias_keys):>6}")
    print(f"  URL mappings (curated): {len(url_keys):>6}")
    print(f"  Prose patterns:        {len(prose_tgt):>6}")
    print(f"  Publisher URLs:        {len(pub_urls):>6}")
    print(f"  Publisher aliases:     {len(pub_aliases):>6}")
    print(f"  REGISTRY entries:     {len(registry):>6}")

    # --- 2. Overlaps ---
    section("Cross-Dataset Overlaps")

    # SPDX overlaps
    def pct(sub: int, total: int) -> str:
        return f"{100 * sub / max(total, 1):.1f}%"

    overlaps = [
        ("SPDX n OSI", len(spdx & osi), len(osi), "OSI"),
        ("SPDX n OD", len(spdx & od), len(od), "OD"),
        ("SPDX n CC", len(spdx & cc), len(cc), "CC"),
        ("OSI n OD", len(osi & od), len(od), "OD"),
        ("OSI n CC", len(osi & cc), len(cc), "CC"),
        ("OD  n CC", len(od & cc), len(cc), "CC"),
        ("ScanCode n SPDX", len(sc & spdx), len(sc), "ScanCode"),
        ("ScanCode n OSI", len(sc & osi), len(sc), "ScanCode"),
    ]
    for label, overlap_count, total_count, pct_label in overlaps:
        ratio = pct(overlap_count, total_count)
        print(f"  {label:<17} {overlap_count:>5}  ({ratio} of {pct_label})")

    # Unique content
    print(f"\n  Unique to SPDX:  {len(spdx - od - osi - cc - sc):>6}")
    print(f"  Unique to OD:    {len(od - spdx):>6}")
    print(f"  Unique to OSI:   {len(osi - spdx):>6}  (OSI IDs not in SPDX)")
    print(f"  Unique to CC:    {len(cc - spdx - od):>6}")
    print(f"  Unique to ScanCode: {len(sc - spdx - osi - od - cc):>6}")

    # --- 3. OSI licenses not in SPDX (reference integrity) ---
    section("OSI Licenses Missing from SPDX")
    osi_only = sorted(osi - spdx)
    if osi_only:
        print(f"  {len(osi_only)} OSI-licensed IDs have no SPDX entry:")
        for lid in osi_only[:20]:
            print(f"    {lid}")
        if len(osi_only) > 20:
            print(f"    ... and {len(osi_only) - 20} more")
    else:
        print("  All OSI IDs are present in SPDX.")

    # --- 4. Curated targets not in REGISTRY ---
    section("Curated Targets Missing from REGISTRY")
    orphan_alias = sorted(
        k for k in alias_keys if not would_resolve(k, registry, merged_aliases)
    )
    orphan_url = sorted(
        k for k in url_keys if not would_resolve(k, registry, merged_aliases)
    )
    orphan_pub = sorted(
        k for k in pub_aliases if not would_resolve(k, registry, merged_aliases)
    )
    if orphan_alias:
        print(f"  Alias keys that fail resolution ({len(orphan_alias)}):")
        for k in orphan_alias[:10]:
            print(f"    {k!r}  ->  {alias_tgt.get(k, '')!r}")
        if len(orphan_alias) > 10:
            print(f"    ... and {len(orphan_alias) - 10} more")
    else:
        print("  All alias keys resolve to REGISTRY entries.")
    if orphan_url:
        print(f"\n  URL keys that fail resolution ({len(orphan_url)}):")
        for k in orphan_url[:10]:
            print(f"    {k[:60]!r}  ->  {url_tgt.get(k, '')!r}")
        if len(orphan_url) > 10:
            print(f"    ... and {len(orphan_url) - 10} more")
    if orphan_pub:
        print(f"\n  Publisher aliases that fail resolution ({len(orphan_pub)}):")
        for k in orphan_pub[:10]:
            print(f"    {k!r}  ->  {pub_aliases[k]!r}")
        if len(orphan_pub) > 10:
            print(f"    ... and {len(orphan_pub) - 10} more")
    print(
        "\n  (Note: prose pattern version_keys are often bare name_keys like "
        "'cc-by'; these resolve via the prose pipeline and are not orphans.)"
    )

    # --- 5. REGISTRY entries not covered by curated data ---
    section("REGISTRY Entries Without Curated Mapping")
    covered = (
        set(alias_tgt.values()) | set(url_tgt.values()) | set(pub_aliases.values())
    )
    uncovered = sorted(k for k in registry if k not in covered)
    if uncovered:
        print(f"  {len(uncovered)} REGISTRY keys have no curated alias/URL mapping:")
        for k in uncovered[:20]:
            print(f"    {k}")
        if len(uncovered) > 20:
            print(f"    ... and {len(uncovered) - 20} more")
    else:
        print("  All REGISTRY entries have at least one curated mapping.")

    # --- 6. Duplicate alias keys (same key -> different targets) ---
    section("Duplicate Keys in Alias / URL Data Files")
    # Check if any key maps to different targets across aliases + url_map
    # (keys are unique within each file, so cross-file check)
    cross_keys = alias_keys & url_keys
    if cross_keys:
        print(f"  Keys in both aliases.json AND url_map.json ({len(cross_keys)}):")
        for k in sorted(cross_keys):
            print(f"    {k!r}: aliases={alias_tgt[k]!r}, url_map={url_tgt[k]!r}")

    # --- 7. Alias target frequency (which targets have the most aliases) ---
    section("Most-Aliased License Targets")
    alias_counts = Counter(alias_tgt.values())
    url_counts = Counter(url_tgt.values())
    pub_counts = Counter(pub_aliases.values())
    combined = alias_counts + url_counts + pub_counts
    for target, count in combined.most_common(15):
        parts = []
        if alias_counts[target]:
            parts.append(f"alias={alias_counts[target]}")
        if url_counts[target]:
            parts.append(f"url={url_counts[target]}")
        if pub_counts[target]:
            parts.append(f"pub={pub_counts[target]}")
        print(f"  {target:<30}  total={count:<4}  ({', '.join(parts)})")

    # --- 8. Summary ---
    section("Summary")
    distinct = len(spdx | od | osi | cc | sc)
    orphans = len(orphan_alias) + len(orphan_url) + len(orphan_pub)
    print(f"  Distinct license IDs:          {distinct}")
    print(f"  Curated alias entries:        {len(alias_keys)}")
    print(f"  Curated URL mappings:         {len(url_keys)}")
    print(f"  Orphan curated targets:       {orphans}")
    print(f"  OSI IDs missing SPDX:         {len(osi_only)}")
    covered_count = len(registry) - len(uncovered)
    print(f"  REGISTRY entries covered:       {covered_count}/{len(registry)}")


if __name__ == "__main__":
    main()

scripts/test_name_inference.py

"""Test name inference accuracy against curated aliases.

Compares heuristic name stripping against curated name_key values from
aliases.json to assess how well automatic name extraction works.

Usage:
    uv run python scripts/test_name_inference.py
    uv run python scripts/test_name_inference.py --json
    uv run python scripts/test_name_inference.py --json --incorrect-only
    uv run python scripts/test_name_inference.py --json --details
"""

from __future__ import annotations

import json
import sys
from pathlib import Path

from licence_normaliser import LicenseNormaliser

DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data"
SCRIPTS_DIR = Path(__file__).parent

_normaliser = LicenseNormaliser()


def load_name_mappings() -> dict[str, str]:
    """Load version_key -> name_key mappings from aliases.json."""
    with open(DATA_DIR / "aliases" / "aliases.json") as f:
        data = json.load(f)
    mappings: dict[str, str] = {}
    for meta in data.values():
        if isinstance(meta, dict):
            vk = meta.get("version_key", "")
            nk = meta.get("name_key", "")
            if vk and nk:
                mappings[vk] = nk
    return mappings


def infer_name_heuristic(version_key: str) -> str:
    """Delegate to the core LicenseNormaliser's _infer_name method."""
    return _normaliser._infer_name(version_key)


def categorize_by_family(mappings: dict[str, str]) -> dict[str, dict[str, str]]:
    """Categorize licenses by inferred family."""
    categories: dict[str, dict[str, str]] = {
        "cc": {},  # Creative Commons
        "copyleft": {},  # GPL/AGPL/LGPL
        "osi": {},  # OSI-approved
        "other": {},
    }

    for vk, nk in mappings.items():
        if vk.startswith("cc-"):
            categories["cc"][vk] = nk
        elif vk.startswith(("gpl-", "agpl-", "lgpl-")):
            categories["copyleft"][vk] = nk
        elif vk.startswith(
            ("mpl-", "apache-", "bsd-", "mit", "isc", "unlicense", "zlib")
        ):
            categories["osi"][vk] = nk
        else:
            categories["other"][vk] = nk

    return categories


def assess_accuracy() -> dict:
    """Assess name inference accuracy."""
    mappings = load_name_mappings()
    categories = categorize_by_family(mappings)

    results: dict = {
        "total_mappings": len(mappings),
        "by_family": {},
    }

    for family, family_mappings in categories.items():
        correct = 0
        incorrect = 0
        details: list[dict] = []

        for vk, curated_nk in family_mappings.items():
            inferred = infer_name_heuristic(vk)
            is_match = inferred == curated_nk
            if is_match:
                correct += 1
            else:
                incorrect += 1

            details.append(
                {
                    "version_key": vk,
                    "curated_name": curated_nk,
                    "inferred_name": inferred,
                    "match": is_match,
                }
            )

        accuracy = (
            round(correct / len(family_mappings) * 100, 1) if family_mappings else 0
        )

        results["by_family"][family] = {
            "total": len(family_mappings),
            "correct": correct,
            "incorrect": incorrect,
            "accuracy_percent": accuracy,
            "details": details,
        }

    # Overall accuracy
    all_correct = sum(r["correct"] for r in results["by_family"].values())
    all_total = sum(r["total"] for r in results["by_family"].values())
    results["overall_accuracy"] = (
        round(all_correct / all_total * 100, 1) if all_total else 0
    )

    return results


def print_report(data: dict) -> None:
    """Print text table report."""
    print("=" * 70)
    print("Name Inference Accuracy Report")
    print("=" * 70)
    print()
    print(f"Total curated mappings: {data['total_mappings']}")
    print(f"Overall accuracy: {data['overall_accuracy']}%")
    print()

    print("-" * 70)
    print("By Family:")
    print("-" * 70)
    print(
        f"{'Family':<15} {'Total':>8} {'Correct':>8} {'Incorrect':>8} {'Accuracy':>10}"
    )
    print("-" * 70)

    for family, stats in data["by_family"].items():
        print(
            f"{family:<15} {stats['total']:>8} {stats['correct']:>8} "
            f"{stats['incorrect']:>8} {stats['accuracy_percent']:>9.1f}%"
        )

    print()

    # Show some incorrect examples
    for family, stats in data["by_family"].items():
        if stats["incorrect"] > 0:
            print("-" * 70)
            print(f"Incorrect in {family}: {stats['incorrect']} cases")
            print("-" * 70)
            print(
                f"{'Version Key':<30} {'Curated (aliases.json)':<25} "
                f"{'Inferred (heuristic)':<20}"
            )
            print("-" * 70)
            for detail in stats["details"][:10]:
                if not detail["match"]:
                    print(
                        f"{detail['version_key']:<30} "
                        f"{detail['curated_name']:<25} {detail['inferred_name']:<20}"
                    )
            incorrect_count = len([d for d in stats["details"] if not d["match"]])
            if incorrect_count > 10:
                print(f"... and {incorrect_count - 10} more")
            print()


def main() -> None:
    json_export = "--json" in sys.argv
    incorrect_only = "--incorrect-only" in sys.argv
    include_details = "--details" in sys.argv
    data = assess_accuracy()

    if json_export:
        for family in data["by_family"]:
            details = data["by_family"][family].get("details", [])
            if incorrect_only:
                data["by_family"][family]["details"] = [
                    d for d in details if not d["match"]
                ]
            elif not include_details:
                data["by_family"][family].pop("details", None)
        print(json.dumps(data, indent=2))
    else:
        print_report(data)


if __name__ == "__main__":
    main()

src/licence_normaliser/init.py

src/licence_normaliser/__init__.py

"""licence_normaliser - License normalisation with a three-level hierarchy."""

from ._core import (
    LicenseFamily,
    LicenseName,
    LicenseVersion,
    normalise_license,
    normalise_licenses,
)
from ._normaliser import LicenseNormaliser
from ._trace import LicenseTrace, LicenseTraceStage
from .exceptions import LicenseNormalisationError, LicenseNotFoundError

__title__ = "licence-normaliser"
__version__ = "0.3.2"
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"

__all__ = (
    "LicenseFamily",
    "LicenseName",
    "LicenseVersion",
    "LicenseNormaliser",
    "LicenseNormalisationError",
    "LicenseNotFoundError",
    "LicenseTrace",
    "LicenseTraceStage",
    "normalise_license",
    "normalise_licenses",
)

src/licence_normaliser/_cache.py

"""Caching layer + strict mode - delegates to LicenseNormaliser with defaults."""

from __future__ import annotations

from threading import Lock
from typing import Iterable

from ._models import LicenseVersion
from ._normaliser import LicenseNormaliser

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "_default",
    "get_registry_keys",
    "normalise_license",
    "normalise_licenses",
)


class _DefaultNormaliser:
    """Thread-safe lazy singleton for the default LicenseNormaliser instance."""

    _instance: LicenseNormaliser | None = None
    _lock: Lock = Lock()

    def get(self) -> LicenseNormaliser:
        if _DefaultNormaliser._instance is None:
            with _DefaultNormaliser._lock:
                if _DefaultNormaliser._instance is None:
                    _DefaultNormaliser._instance = LicenseNormaliser()
        return _DefaultNormaliser._instance


_default = _DefaultNormaliser()


def normalise_license(
    raw: str, *, strict: bool = False, trace: bool | None = None
) -> LicenseVersion:
    """Public API with optional strict mode and trace."""
    return _default.get().normalise_license(raw, strict=strict, trace=trace)


def normalise_licenses(
    raws: Iterable[str], *, strict: bool = False, trace: bool | None = None
) -> list[LicenseVersion]:
    """Batch version with optional trace."""
    return _default.get().normalise_licenses(raws, strict=strict, trace=trace)


def get_registry_keys() -> set[str]:
    """Return the set of all known registry keys from the runtime normaliser."""
    return _default.get().registry_keys()

src/licence_normaliser/_core.py

"""License Normaliser - public orchestration shim."""

from __future__ import annotations

from ._cache import normalise_license, normalise_licenses
from ._models import LicenseFamily, LicenseName, LicenseVersion

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "LicenseFamily",
    "LicenseName",
    "LicenseVersion",
    "normalise_license",
    "normalise_licenses",
)

src/licence_normaliser/_models.py

"""License data models - frozen dataclasses for the three-level hierarchy."""

from __future__ import annotations

from dataclasses import dataclass, field
from typing import Optional

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "LicenseFamily",
    "LicenseName",
    "LicenseVersion",
)


@dataclass(frozen=True, slots=True)
class LicenseFamily:
    key: str

    def __str__(self) -> str:
        return self.key

    def __repr__(self) -> str:
        return f"LicenseFamily({self.key!r})"

    def __eq__(self, other: object) -> bool:
        if isinstance(other, LicenseFamily):
            return self.key == other.key
        if isinstance(other, str):
            return self.key == other
        return NotImplemented

    def __hash__(self) -> int:
        return hash(self.key)


@dataclass(frozen=True, slots=True)
class LicenseName:
    key: str
    family: LicenseFamily

    def __str__(self) -> str:
        return self.key

    def __repr__(self) -> str:
        return f"LicenseName({self.key!r}, family={self.family.key!r})"

    def __eq__(self, other: object) -> bool:
        if isinstance(other, LicenseName):
            return self.key == other.key
        if isinstance(other, str):
            return self.key == other
        return NotImplemented

    def __hash__(self) -> int:
        return hash(self.key)


@dataclass(frozen=True, slots=True)
class LicenseVersion:
    key: str
    url: Optional[str]
    license: LicenseName
    _trace: Optional[object] = field(default=None, repr=False)

    @property
    def family(self) -> LicenseFamily:
        return self.license.family

    def __str__(self) -> str:
        return self.key

    def __repr__(self) -> str:
        return (
            f"LicenseVersion(key={self.key!r}, "
            f"license={self.license.key!r}, "
            f"family={self.license.family.key!r})"
        )

    def __eq__(self, other: object) -> bool:
        if isinstance(other, LicenseVersion):
            return self.key == other.key
        if isinstance(other, str):
            return self.key == other
        return NotImplemented

    def __hash__(self) -> int:
        return hash(self.key)

    def explain(self) -> str:
        """Return explanation of how this license was resolved.

        Set ENABLE_LICENCE_NORMALISER_TRACE=1 to enable tracing,
        or pass trace=True to normalise_license().
        """
        if self._trace is not None:
            return str(self._trace)

        from licence_normaliser._cache import _default
        from licence_normaliser._trace import _should_trace

        if not _should_trace():
            return "Trace disabled. Set ENABLE_LICENCE_NORMALISER_TRACE=1 to enable."

        ln = _default.get()
        cleaned = ln._clean(ln._try_decode_mojibake(self.key))
        result = ln._resolve_with_trace(self.key, cleaned, strict=False)
        trace = result._trace
        return str(trace) if trace else "No trace available."

src/licence_normaliser/_normaliser.py

"""Plugin-based LicenseNormaliser class with configurable constructor injection."""

from __future__ import annotations

import re
from functools import lru_cache
from typing import TYPE_CHECKING, Iterable, Sequence

from licence_normaliser.defaults import (
    get_default_alias,
    get_default_family,
    get_default_name,
    get_default_prose,
    get_default_registry,
    get_default_url,
)

if TYPE_CHECKING:
    from licence_normaliser._models import LicenseVersion
    from licence_normaliser._trace import LicenseTrace

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("LicenseNormaliser",)

_WHITESPACE_RE = re.compile(r"\s+")
_MAX_INPUT = 4096


class LicenseNormaliser:
    """Configurable license normalisation with plugin-based data sources.

    Plugins are passed as CLASSES (not instances). They're instantiated lazily
    when their load_* method is called.

    Six plugin types are supported (each returns specific data structure):
        - registry: key -> canonical_key
        - url: cleaned_url -> version_key
        - alias: alias_string -> version_key
        - family: version_key -> family_key
        - name: version_key -> name_key
        - prose: list of (compiled_pattern, version_key)

    Resolution order: aliases -> registry -> url -> prose -> unknown
    Name/family inference: plugins only - no fallback to hardcoded logic.

    Tracing
        Set ``trace=True`` to include resolution trace in the result. Trace shows
        which pipeline stage matched and the source file/line number (when
        available). Trace is disabled by default for performance.

        Trace can be enabled at three levels (precedence: method >
        constructor > env var):

        - **Constructor**: ``LicenseNormaliser(trace=True)`` - all calls get trace
        - **Method**: ``ln.normalise_license("MIT", trace=True)`` - this call only
        - **Environment**: ``ENABLE_LICENCE_NORMALISER_TRACE=1`` - applies globally

    Example::

        from licence_normaliser import LicenseNormaliser

        # Uses all defaults automatically
        ln = LicenseNormaliser()

        # Disable caching for debugging
        ln = LicenseNormaliser(cache=False)

        # Enable trace for all calls on this instance
        ln = LicenseNormaliser(trace=True)
        v = ln.normalise_license("MIT")
        print(v.explain())  # Shows resolution path with source lines

        # Or enable trace for a single call
        v = ln.normalise_license("MIT", trace=True)
    """

    def __init__(
        self,
        *,
        registry: Sequence[type] | None = None,
        url: Sequence[type] | None = None,
        alias: Sequence[type] | None = None,
        family: Sequence[type] | None = None,
        name: Sequence[type] | None = None,
        prose: Sequence[type] | None = None,
        cache: bool = True,
        cache_maxsize: int = 8192,
        trace: bool | None = None,
    ) -> None:
        self._registry: dict[str, str] = {}
        self._url_map: dict[str, str] = {}
        self._url_to_vkey: dict[str, str] = {}
        self._aliases: dict[str, str] = {}
        self._alias_lines: dict[str, tuple[str, int]] = {}
        self._publisher_alias_lines: dict[str, tuple[str, int]] = {}
        self._publisher_url_lines: dict[str, tuple[str, int]] = {}
        self._prose_lines: list[tuple[re.Pattern[str], str, int]] = []
        self._alias_lines_loaded: bool = False
        self._family_overrides: dict[str, str] = {}
        self._name_overrides: dict[str, str] = {}
        self._prose_patterns: list[tuple[re.Pattern[str], str]] = []
        self._cache = cache
        self._cache_maxsize = cache_maxsize
        self._trace_default = trace

        # Load plugins - use defaults if not explicitly provided
        registry = registry or get_default_registry()
        url = url or get_default_url()
        alias = alias or get_default_alias()
        family = family or get_default_family()
        name = name or get_default_name()
        prose = prose or get_default_prose()

        # Store plugin lists for trace resolution
        self._alias_plugins = alias
        self._url_plugins = url
        self._prose_plugins = prose

        # Instantiate plugins and load their data
        for plugin_cls in registry:
            data = plugin_cls().load_registry()
            self._registry.update(data)

        for plugin_cls in url:
            data = plugin_cls().load_urls()
            self._url_map.update(data)

        # Build inverted URL map: version_key -> cleaned_url (for LicenseVersion.url)
        self._url_to_vkey = {v: k for k, v in self._url_map.items()}

        for plugin_cls in alias:
            data = plugin_cls().load_aliases()
            self._aliases.update(data)

        for plugin_cls in family:
            data = plugin_cls().load_families()
            self._family_overrides.update(data)

        for plugin_cls in name:
            data = plugin_cls().load_names()
            self._name_overrides.update(data)

        for plugin_cls in prose:
            patterns = plugin_cls().load_prose()
            self._prose_patterns.extend(patterns)

        # Set up cached resolution
        if self._cache:
            resolve_fn = lru_cache(maxsize=self._cache_maxsize)(self._resolve_impl)
            # type: ignore[assignment]
            self._resolve_impl = resolve_fn

    def _get_trace_mode(self, trace: bool | None) -> bool:
        """Determine if tracing is enabled: explicit > env var > default."""
        from licence_normaliser._trace import _should_trace

        if trace is not None:
            return trace
        if self._trace_default is not None:
            return self._trace_default
        return _should_trace()

    def _load_alias_lines(self):
        """Lazy load all source line numbers on first trace request."""
        for plugin_cls in self._alias_plugins:
            if hasattr(plugin_cls, "load_aliases_with_lines"):
                lines_data = plugin_cls().load_aliases_with_lines()
                for alias_key, (version_key, line_num) in lines_data.items():
                    if version_key == self._aliases.get(alias_key):
                        self._alias_lines[alias_key] = (version_key, line_num)

        for plugin_cls in self._alias_plugins:
            if hasattr(plugin_cls, "load_aliases_with_lines"):
                lines_data = plugin_cls().load_aliases_with_lines()
                for alias_key, (version_key, line_num) in lines_data.items():
                    if (
                        version_key == self._aliases.get(alias_key)
                        and alias_key not in self._alias_lines
                    ):
                        self._alias_lines[alias_key] = (version_key, line_num)

        for plugin_cls in self._url_plugins:
            if hasattr(plugin_cls, "load_aliases_with_lines"):
                lines_data = plugin_cls().load_aliases_with_lines()
                for alias_key, (version_key, line_num) in lines_data.items():
                    if version_key == self._aliases.get(alias_key):
                        self._publisher_alias_lines[alias_key] = (version_key, line_num)

        for plugin_cls in self._url_plugins:
            if hasattr(plugin_cls, "load_urls_with_lines"):
                lines_data = plugin_cls().load_urls_with_lines()
                for url_key, (version_key, line_num) in lines_data.items():
                    if version_key == self._url_map.get(url_key):
                        self._publisher_url_lines[url_key] = (version_key, line_num)

        for plugin_cls in self._prose_plugins:
            if hasattr(plugin_cls, "load_prose_with_lines"):
                lines_data = plugin_cls().load_prose_with_lines()
                self._prose_lines.extend(lines_data)

    def _resolve_with_trace(
        self, raw: str, cleaned: str, strict: bool
    ) -> LicenseVersion:
        """Resolve with full pipeline tracing."""
        from licence_normaliser._trace import LicenseTrace, LicenseTraceStage

        # Lazy load alias lines on first trace call
        if not self._alias_lines_loaded:
            self._load_alias_lines()
            self._alias_lines_loaded = True

        stages: list[LicenseTraceStage] = []

        # 1. Alias lookup
        if cleaned in self._aliases:
            output = self._aliases[cleaned]
            source_line = None
            source_file = None
            if cleaned in self._alias_lines:
                _, source_line = self._alias_lines[cleaned]
                source_file = "aliases.json"
            stages.append(
                LicenseTraceStage(
                    "alias", cleaned, output, True, source_line, source_file
                )
            )
            v = self._make(output)
            trace = LicenseTrace(
                raw,
                cleaned,
                stages,
                version_key=v.key,
                name_key=v.license.key,
                family_key=v.family.key,
            )
            return self._make_with_trace(v, trace)

        stages.append(LicenseTraceStage("alias", cleaned, "", False))

        # 2. Registry lookup
        if cleaned in self._registry:
            canonical = self._registry[cleaned]
            stages.append(LicenseTraceStage("registry", cleaned, canonical, True))
            v = self._make(canonical)
            trace = LicenseTrace(
                raw,
                cleaned,
                stages,
                version_key=v.key,
                name_key=v.license.key,
                family_key=v.family.key,
            )
            return self._make_with_trace(v, trace)

        stages.append(LicenseTraceStage("registry", cleaned, "", False))

        # 3. URL lookup
        url_key = self._normalise_url(cleaned)
        if url_key in self._url_map:
            resolved = self._url_map[url_key]
            source_line = None
            source_file = None
            if url_key in self._publisher_url_lines:
                _, source_line = self._publisher_url_lines[url_key]
                source_file = "publishers.json"
            stages.append(
                LicenseTraceStage(
                    "url", url_key, resolved, True, source_line, source_file
                )
            )
            v = self._make(resolved)
            trace = LicenseTrace(
                raw,
                cleaned,
                stages,
                version_key=v.key,
                name_key=v.license.key,
                family_key=v.family.key,
            )
            return self._make_with_trace(v, trace)

        stages.append(LicenseTraceStage("url", cleaned, "", False))

        # 4. Prose matching (only for longer strings)
        if len(cleaned) >= 20:
            for i, (pattern, vkey) in enumerate(self._prose_patterns):
                if pattern.search(cleaned):
                    source_line = None
                    source_file = "prose_patterns.json"
                    if self._prose_lines and i < len(self._prose_lines):
                        _, _, source_line = self._prose_lines[i]
                    stages.append(
                        LicenseTraceStage(
                            "prose", cleaned, vkey, True, source_line, source_file
                        )
                    )
                    v = self._make(vkey)
                    trace = LicenseTrace(
                        raw,
                        cleaned,
                        stages,
                        version_key=v.key,
                        name_key=v.license.key,
                        family_key=v.family.key,
                    )
                    return self._make_with_trace(v, trace)

        stages.append(LicenseTraceStage("prose", cleaned, "", False))

        # 5. Fallback to unknown
        stages.append(LicenseTraceStage("fallback", cleaned, cleaned, True))
        v = self._make_unknown(cleaned)
        trace = LicenseTrace(
            raw,
            cleaned,
            stages,
            version_key=v.key,
            name_key=v.license.key,
            family_key=v.family.key,
        )
        return self._make_with_trace(v, trace)

    def _make_with_trace(
        self, v: LicenseVersion, trace: LicenseTrace
    ) -> LicenseVersion:
        """Create a LicenseVersion with trace attached."""

        # Reconstruct with trace using object.__setattr__ (frozen dataclass)
        object.__setattr__(v, "_trace", trace)
        return v

    def _resolve_impl(self, cleaned: str) -> LicenseVersion:
        # 1. Alias lookup
        if cleaned in self._aliases:
            return self._make(self._aliases[cleaned])

        # 2. Registry lookup
        if cleaned in self._registry:
            canonical = self._registry[cleaned]
            return self._make(canonical)

        # 3. URL lookup
        url_key = self._normalise_url(cleaned)
        if url_key in self._url_map:
            return self._make(self._url_map[url_key])

        # 4. Prose matching (only for longer strings)
        if len(cleaned) >= 20:
            for pattern, vkey in self._prose_patterns:
                if pattern.search(cleaned):
                    return self._make(vkey)

        # 5. Fallback to unknown
        return self._make_unknown(cleaned)

    def normalise_license(
        self, raw: str, *, strict: bool = False, trace: bool | None = None
    ) -> LicenseVersion:
        """Normalise a single license string.

        Args:
            raw: The raw license string, SPDX ID, URL, or prose description.
            strict: If True, raises ``LicenseNotFoundError`` when the input
                cannot be resolved to a known license.
            trace: If True, include resolution trace showing which pipeline
                stage matched and source file/line. If None, uses the instance
                default (``trace`` param from constructor) or falls back to
                ``ENABLE_LICENCE_NORMALISER_TRACE`` env var.

        Returns:
            A ``LicenseVersion`` with the resolved key, license name, and family.

        Raises:
            LicenseNotFoundError: When ``strict=True`` and resolution fails.
        """
        from licence_normaliser.exceptions import LicenseNotFoundError

        do_trace = self._get_trace_mode(trace)

        if not raw or not raw.strip():
            cleaned = "unknown"
            v = self._make_unknown(cleaned)
            if do_trace:
                from licence_normaliser._trace import LicenseTrace, LicenseTraceStage

                stages = [LicenseTraceStage("fallback", cleaned, cleaned, True)]
                trace_obj = LicenseTrace(
                    raw,
                    cleaned,
                    stages,
                    version_key=v.key,
                    name_key=v.license.key,
                    family_key=v.family.key,
                )
                v = self._make_with_trace(v, trace_obj)
        else:
            cleaned = self._clean(self._try_decode_mojibake(raw))
            if do_trace:
                v = self._resolve_with_trace(raw, cleaned, strict)
            else:
                v = self._resolve_impl(cleaned)

        if strict and v.family.key == "unknown":
            raise LicenseNotFoundError(raw, v.key) from None
        return v

    def normalise_licenses(
        self, raws: Iterable[str], *, strict: bool = False, trace: bool | None = None
    ) -> list[LicenseVersion]:
        """Batch normalisation.

        When ``strict=True``, raises on the first failure.
        """
        from licence_normaliser.exceptions import LicenseNotFoundError

        results: list[LicenseVersion] = []
        for raw in raws:
            v = self.normalise_license(raw, strict=False, trace=trace)
            if strict and v.family.key == "unknown":
                raise LicenseNotFoundError(raw, v.key) from None
            results.append(v)
        return results

    def registry_keys(self) -> set[str]:
        """Return the set of all known registry keys."""
        return set(self._registry.keys())

    def _make(self, key: str) -> LicenseVersion:
        """Factory: build a LicenseVersion from a resolved version_key."""
        from licence_normaliser._models import (
            LicenseFamily,
            LicenseName,
            LicenseVersion,
        )

        k = key.lower().strip()

        # Get canonical key from registry
        canonical = self._registry.get(k) or k

        # Get URL via inverted map: version_key -> cleaned_url
        url = self._url_to_vkey.get(canonical) or self._url_to_vkey.get(k)

        # Infer name:
        # - For CC licenses, use override only if it's different from canonical
        # - For non-CC (GPL, AGPL, OSI, etc.), always return canonical (no stripping)
        override_name = self._name_overrides.get(canonical)
        if canonical.startswith("cc-") or canonical.startswith("cc0"):
            # CC licenses: use override if present, otherwise fallback to _infer_name
            name_key = override_name if override_name else self._infer_name(canonical)
        else:
            # Non-CC: use override if present and different, otherwise canonical
            name_key = (
                override_name
                if override_name and override_name != canonical
                else canonical
            )

        # Infer family: use override only if it provides a different value
        override_family = self._family_overrides.get(canonical)
        family_key = (
            override_family
            if override_family and override_family != canonical
            else self._infer_family(canonical)
        )

        family = LicenseFamily(key=family_key)
        name = LicenseName(key=name_key, family=family)
        return LicenseVersion(key=canonical, url=url, license=name)

    def _make_unknown(self, key: str) -> LicenseVersion:
        """Factory: build an unknown LicenseVersion for unresolved input."""
        from licence_normaliser._models import (
            LicenseFamily,
            LicenseName,
            LicenseVersion,
        )

        family = LicenseFamily(key="unknown")
        name = LicenseName(key=key, family=family)
        return LicenseVersion(key=key, url=None, license=name)

    def _infer_family(self, key: str) -> str:
        """Fallback family inference - only used if no plugin provides it."""
        k = key.lower()
        if k.startswith("cc0"):
            return "cc0"
        if k.startswith("cc-pdm"):
            return "public-domain"
        if k.startswith("cc-"):
            return "cc"
        if k.startswith(("gpl-", "agpl-", "lgpl-")):
            return "copyleft"
        if k.startswith(("odbl", "odc-by")):
            return "open-data"
        if k.startswith(("pddl-", "odc-")):
            return "data"
        if k.startswith(
            (
                "elsevier-oa",
                "acs-authorchoice",
                "acs-authorchoice-ccby",
                "acs-authorchoice-ccbyncnd",
                "acs-authorchoice-nih",
                "jama-cc-by",
                "thieme-nlm",
                "implied-oa",
                "unspecified-oa",
                "publisher-specific-oa",
                "author-manuscript",
                "oup-chorus",
            )
        ):
            return "publisher-oa"
        if k.startswith(
            (
                "elsevier-tdm",
                "wiley-tdm",
                "springer-tdm",
                "springernature-tdm",
                "iop-tdm",
                "aps-tdm",
            )
        ):
            return "publisher-tdm"
        if k.startswith(
            (
                "elsevier-",
                "wiley-",
                "springer-",
                "springernature-",
                "acs-",
                "rsc-",
                "iop-",
                "bmj-",
                "aaas-",
                "pnas-",
                "aps-",
                "cup-",
                "aip-",
                "jama-",
                "degruyter-",
                "oup-",
                "sage-",
                "tandf-",
                "thieme-",
            )
        ):
            return "publisher-proprietary"
        if k in ("public-domain", "other-oa", "open-access"):
            return "public-domain" if k == "public-domain" else "other-oa"
        return "unknown"

    def _infer_name(self, key: str) -> str:
        """Fallback name inference - only used if no plugin provides it."""
        k = key.lower()
        if k.startswith("cc0"):
            return "cc0"
        if k.startswith("cc-"):
            parts = k.split("-")
            for i, part in enumerate(parts):
                if part.replace(".", "").isdigit():
                    return "-".join(parts[:i])
            return "-".join(parts[:2])
        # For all other licenses (GPL, AGPL, OSI, etc.), keep the key as-is
        return k

    @staticmethod
    def _clean(raw: str) -> str:
        s = _WHITESPACE_RE.sub(" ", raw.strip().rstrip("/")).lower()
        return s[:_MAX_INPUT]

    @staticmethod
    def _try_decode_mojibake(s: str) -> str:
        try:
            return s.encode("latin-1").decode("utf-8")
        except (UnicodeEncodeError, UnicodeDecodeError):
            return s

    @staticmethod
    def _normalise_url(cleaned: str) -> str:
        key = cleaned.lower()
        if key.startswith("http://"):
            key = "https://" + key[7:]
        return key.rstrip("/")

src/licence_normaliser/_trace.py

"""License trace and explanation support."""

from __future__ import annotations

import os
from dataclasses import dataclass, field

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "TRACE_STAGES",
    "LicenseTrace",
    "LicenseTraceStage",
)

TRACE_STAGES = ("alias", "registry", "url", "prose", "fallback")


@dataclass
class LicenseTraceStage:
    """Single stage in the license resolution pipeline."""

    stage: str
    input: str
    output: str
    matched: bool
    source_line: int | None = None
    source_file: str | None = None


@dataclass
class LicenseTrace:
    """Complete trace of license resolution pipeline."""

    raw_input: str
    cleaned_input: str
    stages: list[LicenseTraceStage] = field(default_factory=list)
    version_key: str = ""
    name_key: str = ""
    family_key: str = ""

    def __str__(self) -> str:
        lines = [f"Input: {self.raw_input!r} → {self.cleaned_input!r}"]
        for s in self.stages:
            status = "✓" if s.matched else "-"
            source_info = ""
            if s.source_line is not None:
                source_info = f" (line {s.source_line}"
                if s.source_file:
                    source_info += f" in {s.source_file}"
                source_info += ")"
            lines.append(
                f"  [{status}] {s.stage}: {s.input!r} → {s.output!r}{source_info}"
            )
        lines.append("")
        lines.append("Result:")
        lines.append(f"  version_key: {self.version_key!r}")
        lines.append(f"  name_key: {self.name_key!r}")
        lines.append(f"  family_key: {self.family_key!r}")
        return "\n".join(lines)


def _should_trace() -> bool:
    """Check if tracing is enabled via environment variable."""
    return os.environ.get("ENABLE_LICENCE_NORMALISER_TRACE", "").lower() in (
        "1",
        "true",
        "yes",
    )

src/licence_normaliser/cli/init.py

src/licence_normaliser/cli/__init__.py

"""licence_normaliser.cli - command-line interface for licence-normaliser."""

from ._main import main

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("main",)

src/licence_normaliser/cli/_main.py

"""licence-normaliser CLI - license normalisation from the command line."""

import argparse
import sys
from pathlib import Path

from licence_normaliser import __version__, normalise_license
from licence_normaliser._trace import _should_trace
from licence_normaliser.defaults import get_all_refreshable_plugins
from licence_normaliser.exceptions import (
    LicenseNormalisationError,
    LicenseNotFoundError,
)

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("main",)


def _build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(
        prog="licence-normaliser",
        description="Comprehensive license normalisation - three-level hierarchy.",
    )
    parser.add_argument(
        "--version",
        action="version",
        version=f"%(prog)s {__version__}",
    )

    sub = parser.add_subparsers(dest="command", required=True)

    norm = sub.add_parser("normalise", help="Normalise a license string.")
    norm.add_argument("license", help="License string to normalise.")
    norm.add_argument("--full", action="store_true")
    norm.add_argument("--strict", action="store_true")
    norm.add_argument("--trace", action="store_true", help="Show resolution trace.")

    batch = sub.add_parser("batch", help="Normalise multiple license strings.")
    batch.add_argument("licenses", nargs="+")
    batch.add_argument("--strict", action="store_true")
    batch.add_argument(
        "--trace", action="store_true", help="Show resolution trace for each."
    )

    update = sub.add_parser(
        "update-data", help="Fetch fresh data from all registered parsers."
    )
    update.add_argument(
        "--parser",
        dest="parser_name",
        metavar="NAME",
        help="Refresh only the named parser (e.g. spdx, opendefinition, osi). "
        "Without this flag, all parsers are refreshed.",
    )
    update.add_argument(
        "--force",
        action="store_true",
        help="Overwrite even if the local file already exists.",
    )

    return parser


def _cmd_normalise(args: argparse.Namespace) -> int:
    try:
        trace = args.trace or _should_trace()
        result = normalise_license(args.license, strict=args.strict, trace=trace)
        if trace:
            print(result.explain())
        elif args.full:
            print(f"Key: {result.key}")
            print(f"URL: {result.url or '(none)'}")
            print(f"License: {result.license}")
            print(f"Family: {result.family}")
        else:
            print(result.key)
    except LicenseNotFoundError as exc:
        print(f"error: {exc}", file=sys.stderr)
        return 1
    except LicenseNormalisationError as exc:
        print(f"error: {exc}", file=sys.stderr)
        return 1
    return 0


def _cmd_batch(args: argparse.Namespace) -> int:
    trace = args.trace or _should_trace()
    if args.strict:
        try:
            for license_str in args.licenses:
                result = normalise_license(license_str, strict=True, trace=trace)
                if trace:
                    print(f"{license_str}:")
                    print(result.explain())
                else:
                    print(f"{license_str}: {result.key}")
        except LicenseNotFoundError as exc:
            print(f"error: {exc}", file=sys.stderr)
            return 1
    else:
        for license_str in args.licenses:
            result = normalise_license(license_str, strict=False, trace=trace)
            if trace:
                print(f"{license_str}:")
                print(result.explain())
            else:
                print(f"{license_str}: {result.key}")
    return 0


def _cmd_update_data(args: argparse.Namespace) -> int:
    parser_classes = get_all_refreshable_plugins()
    if args.parser_name:
        parser_classes = [
            p for p in parser_classes if getattr(p, "id", None) == args.parser_name
        ]
        if not parser_classes:
            available = [
                getattr(p, "id", p.__name__) for p in get_all_refreshable_plugins()
            ]
            print(
                f"error: unknown parser {args.parser_name!r}. Available: {available}",
                file=sys.stderr,
            )
            return 1

    failed: list[str] = []
    for parser_cls in parser_classes:
        name = getattr(parser_cls, "id", parser_cls.__name__)
        url = parser_cls.url
        target = parser_cls.local_path
        target_path = Path(__file__).parent.parent / target
        ok = parser_cls.refresh(args.force)
        if target_path.exists() and not args.force:
            status = "skipped"
        elif ok:
            status = "fetched"
        else:
            status = "FAILED"
        if not ok:
            failed.append(name)
        print(f"  {status}: {name} ({url}) -> {target}")

    if failed:
        print(f"error: failed to refresh: {', '.join(failed)}", file=sys.stderr)
        return 1
    print("Data sources updated successfully.")
    return 0


def main() -> None:
    parser = _build_parser()
    args = parser.parse_args()

    if args.command == "normalise":
        sys.exit(_cmd_normalise(args))
    elif args.command == "batch":
        sys.exit(_cmd_batch(args))
    elif args.command == "update-data":
        sys.exit(_cmd_update_data(args))
    else:
        parser.print_help()
        sys.exit(1)

src/licence_normaliser/data/README.rst

Data Directory
==============

This directory contains all normalisation data files loaded at runtime
by ``licence-normaliser``. You can extend or override entries without
touching any Python code.

Structure
---------

::

    data/
    ├── aliases/
    │   └── aliases.json             # Alias string → metadata dict
    ├── urls/
    │   └── url_map.json             # Canonical URL → metadata dict
    ├── prose/
    │   └── prose_patterns.json      # Ordered regex patterns for long text scanning
    ├── publishers/
    │   └── publishers.json          # Publisher URLs and shorthand aliases
    ├── spdx/
    │   └── spdx.json                # SPDX license list (auto-refreshed)
    ├── opendefinition/
    │   └── opendefinition.json      # Open Definition list (auto-refreshed)
    ├── osi/
    │   └── osi.json                 # OSI license list (auto-refreshed)
    ├── creativecommons/
    │   └── creativecommons.json     # CC licenses (scraped from creativecommons.org)
    └── scancode_licensedb/
        └── scancode_licensedb.json  # ScanCode license DB (auto-refreshed)

Entry Format
------------

Every entry maps a **lookup key** (alias string, URL, or prose pattern)
to a metadata dict with three required fields:

- ``version_key`` – the canonical version-level identifier
  (e.g. ``"cc-by-4.0"``)
- ``name_key`` – the name-level identifier without version suffix
  (e.g. ``"cc-by"``)
- ``family_key`` – the family-level identifier (e.g. ``"cc"``)

URLs are stored separately in the ``url`` field of the metadata dict.

How to Add a New License Alias
------------------------------

Edit ``aliases/aliases.json``:

.. code:: json

   {
     "my new alias": {
       "version_key": "cc-by-4.0",
       "name_key": "cc-by",
       "family_key": "cc"
     }
   }

The key must be **lowercase and whitespace-collapsed**.

How to Add a Publisher URL or Shorthand
---------------------------------------

Edit ``publishers/publishers.json``:

.. code:: json

   {
     "urls": {
       "https://example.com/my-license/": {
         "version_key": "my-license",
         "name_key": "my-license",
         "family_key": "publisher-oa"
       }
     },
     "shorthand_aliases": {
       "my shorthand alias": "my-license"
     }
   }

Both ``http://`` and ``https://`` URL variants may be listed; they are
normalised at lookup time (http→https, trailing slash stripped).

How to Add a New URL Mapping
----------------------------

Edit ``urls/url_map.json``:

.. code:: json

   {
     "https://example.com/my-license/": {
       "version_key": "my-license",
       "name_key": "my-license",
       "family_key": "publisher-oa"
     }
   }

How to Add a New Prose Pattern
------------------------------

Edit ``prose/prose_patterns.json`` — insert your entry **before** any
pattern it should take priority over:

.. code:: json

   [
     {"pattern": "my very specific phrase",
      "version_key": "my-license",
      "name_key": "my-license",
      "family_key": "publisher-oa"},
     ...
   ]

Patterns are Python regular expressions matched case-insensitively.
More-specific patterns must come first.

How to Add a Brand-New License
------------------------------

1. Add entries to one or more JSON data files (``aliases/aliases.json``,
   ``urls/url_map.json``, ``prose/prose_patterns.json``, or
   ``publishers/publishers.json``). Each entry maps a key to a dict with
   ``version_key``, ``name_key``, and ``family_key``.

2. If the ``family_key`` is not covered by the regex fallback table in
   ``_registry.py``, add an explicit ``family_key`` value in the JSON
   entry (recommended).

3. Run ``make test-env ENV=py312`` to verify.

Updating SPDX or OpenDefinition
-------------------------------

The ``licence-normaliser update-data`` CLI command fetches fresh upstream data:

.. code:: sh

    licence-normaliser update-data --force

This updates:

- ``spdx/spdx.json`` — full `SPDX license list <https://spdx.org/licenses/>`_
- ``opendefinition/opendefinition.json`` — full `Open Definition list <https://opendefinition.org/>`_
- ``osi/osi.json`` — `OSI license list <https://opensource.org/licenses>`_
- ``creativecommons/creativecommons.json`` — scraped from creativecommons.org
- ``scancode_licensedb/scancode_licensedb.json`` — `ScanCode license DB <https://scancode-licensedb.aboutcode.org/>`_

Family Override Files
---------------------

Some entries carry an explicit ``family_key`` that overrides the
inference logic in ``_registry.py``.  These are stored in:

- ``aliases/aliases.json`` — ``family_key`` on any alias entry populates
  ``FAMILY_OVERRIDES`` at import time.

src/licence_normaliser/data/aliases/aliases.json

{
  "_comment": "Curated alias map: cleaned-lowercase-string -> metadata dict.",
  "_comment2": "Keys must already be in cleaned form (lowercase, whitespace-collapsed).",
  "aaas reuse": {
    "version_key": "aaas-author-reuse",
    "name_key": "aaas-author-reuse",
    "family_key": "publisher-proprietary",
    "aliases": [
      "aaas author reuse",
      "aaas-author-reuse",
      "science author reuse"
    ]
  },
  "acs authorchoice": {
    "version_key": "acs-authorchoice",
    "name_key": "acs-authorchoice",
    "family_key": "publisher-oa",
    "aliases": [
      "acs-authorchoice"
    ]
  },
  "acs-authorchoice-ccby": {
    "version_key": "acs-authorchoice-ccby",
    "name_key": "acs-authorchoice-ccby",
    "family_key": "publisher-oa",
    "aliases": [
      "acs authorchoice cc by"
    ]
  },
  "acs-authorchoice-ccbyncnd": {
    "version_key": "acs-authorchoice-ccbyncnd",
    "name_key": "acs-authorchoice-ccbyncnd",
    "family_key": "publisher-oa"
  },
  "acs-authorchoice-nih": {
    "version_key": "acs-authorchoice-nih",
    "name_key": "acs-authorchoice-nih",
    "family_key": "publisher-oa"
  },
  "agpl-3": {
    "version_key": "agpl-3.0",
    "name_key": "agpl-3",
    "family_key": "copyleft",
    "aliases": [
      "agpl-v3",
      "agpl 3",
      "agpl",
      "agpl v3",
      "agpl-3.0+"
    ]
  },
  "aip-rights": {
    "version_key": "aip-rights",
    "name_key": "aip-rights",
    "family_key": "publisher-proprietary",
    "aliases": [
      "aip permissions"
    ]
  },
  "all rights reserved": {
    "version_key": "all-rights-reserved",
    "name_key": "all-rights-reserved",
    "family_key": "publisher-proprietary",
    "aliases": [
      "all-rights-reserved"
    ]
  },
  "apache 2.0": {
    "version_key": "apache-2.0",
    "name_key": "apache",
    "family_key": "osi",
    "aliases": [
      "apache 2",
      "apache",
      "apache license",
      "apache license 2.0"
    ]
  },
  "aps-default": {
    "version_key": "aps-default",
    "name_key": "aps-default",
    "family_key": "publisher-proprietary",
    "aliases": [
      "aps default license"
    ]
  },
  "aps-tdm": {
    "version_key": "aps-tdm",
    "name_key": "aps-tdm",
    "family_key": "publisher-tdm",
    "aliases": [
      "aps text mining"
    ]
  },
  "author manuscript": {
    "version_key": "author-manuscript",
    "name_key": "author-manuscript",
    "family_key": "publisher-oa",
    "aliases": [
      "author-manuscript"
    ]
  },
  "bmj-copyright": {
    "version_key": "bmj-copyright",
    "name_key": "bmj-copyright",
    "family_key": "publisher-proprietary"
  },
  "bsd 2-clause": {
    "version_key": "bsd-2-clause",
    "name_key": "bsd-2-clause",
    "family_key": "osi",
    "aliases": [
      "bsd 2 clause",
      "bsd-2-clause",
      "bsd-2"
    ]
  },
  "bsd 3-clause": {
    "version_key": "bsd-3-clause",
    "name_key": "bsd-3-clause",
    "family_key": "osi",
    "aliases": [
      "bsd 3 clause",
      "bsd-3-clause",
      "bsd-3",
      "bsd-3 license",
      "bsd",
      "bsd license"
    ],
    "justification": "BSD 3-Clause is sometimes called 'BSD', so we need to make sure that this doesn't get confused with the generic 'bsd' alias for the BSD-2-Clause license."
  },
  "cc by": {
    "version_key": "cc-by",
    "name_key": "cc-by",
    "family_key": "cc",
    "aliases": [
      "cc-by",
      "cc by",
      "creative commons attribution",
      "creative commons attribution license",
      "creative commons by"
    ]
  },
  "cc by 1.0": {
    "version_key": "cc-by-1.0",
    "name_key": "cc-by",
    "family_key": "cc"
  },
  "cc by 2.0": {
    "version_key": "cc-by-2.0",
    "name_key": "cc-by",
    "family_key": "cc"
  },
  "cc by 2.5": {
    "version_key": "cc-by-2.5",
    "name_key": "cc-by",
    "family_key": "cc"
  },
  "cc by 3.0": {
    "version_key": "cc-by-3.0",
    "name_key": "cc-by",
    "family_key": "cc",
    "aliases": [
      "cc-by-3.0",
      "cc-by-3",
      "creative commons attribution 3.0"
    ]
  },
  "cc by 4.0": {
    "version_key": "cc-by-4.0",
    "name_key": "cc-by",
    "family_key": "cc",
    "aliases": [
      "cc-by-4.0",
      "cc by 4",
      "cc-by 4",
      "cc-by-4",
      "creative commons attribution 4.0",
      "creative commons attribution 4.0 international",
      "creative commons attribution 4.0 international license",
      "creative commons by 4.0"
    ]
  },
  "cc by-nc": {
    "version_key": "cc-by-nc",
    "name_key": "cc-by-nc",
    "family_key": "cc",
    "aliases": [
      "cc-by-nc",
      "cc by nc",
      "cc-by nc",
      "creative commons attribution-noncommercial",
      "creative commons by-nc"
    ]
  },
  "cc by-nc 3.0": {
    "version_key": "cc-by-nc-3.0",
    "name_key": "cc-by-nc",
    "family_key": "cc"
  },
  "cc by-nc 4.0": {
    "version_key": "cc-by-nc-4.0",
    "name_key": "cc-by-nc",
    "family_key": "cc",
    "aliases": [
      "cc-by-nc-4.0",
      "cc by nc 4",
      "cc-by nc 4",
      "cc by nc-4",
      "cc-by nc-4",
      "cc-by-nc 4",
      "creative commons attribution-noncommercial 4.0",
      "creative commons attribution-noncommercial 4.0 international",
      "creative commons attribution-noncommercial 4.0 international license",
      "creative commons by-nc 4.0"
    ]
  },
  "cc by-nc-nd": {
    "version_key": "cc-by-nc-nd",
    "name_key": "cc-by-nc-nd",
    "family_key": "cc",
    "aliases": [
      "cc-by-nc-nd",
      "cc by nc-nd",
      "cc by nc nd",
      "cc-by nc-nd",
      "creative commons attribution-noncommercial-noderivatives",
      "creative commons by-nc-nd"
    ]
  },
  "cc by-nc-nd 3.0": {
    "version_key": "cc-by-nc-nd-3.0",
    "name_key": "cc-by-nc-nd",
    "family_key": "cc"
  },
  "cc by-nc-nd 3.0 igo": {
    "version_key": "cc-by-nc-nd-3.0-igo",
    "name_key": "cc-by-nc-nd",
    "family_key": "cc",
    "justification": "IGO is a jurisdiction tag not a rights modifier. Rights profile (Attribution + NonCommercial + NoDerivatives) is identical to base instrument. Enforcement differs (international arbitration vs domestic courts) but does not affect license type."
  },
  "cc by-nc-nd 4.0": {
    "version_key": "cc-by-nc-nd-4.0",
    "name_key": "cc-by-nc-nd",
    "family_key": "cc",
    "aliases": [
      "cc-by-nc-nd-4.0",
      "cc by nc-nd 4",
      "cc-by nc-nd 4",
      "cc by nc-nd-4",
      "cc-by nc-nd-4",
      "cc-by-nc-nd 4",
      "creative commons attribution-noncommercial-noderivatives 4.0",
      "creative commons attribution-noncommercial-noderivatives 4.0 international",
      "creative commons attribution-noncommercial-noderivatives 4.0 international license",
      "creative commons by-nc-nd 4.0"
    ]
  },
  "cc by-nc-sa": {
    "version_key": "cc-by-nc-sa",
    "name_key": "cc-by-nc-sa",
    "family_key": "cc",
    "aliases": [
      "cc-by-nc-sa",
      "cc by nc-sa",
      "cc by nc sa",
      "cc-by nc-sa",
      "creative commons by-nc-sa"
    ]
  },
  "cc by-nc-sa 3.0": {
    "version_key": "cc-by-nc-sa-3.0",
    "name_key": "cc-by-nc-sa",
    "family_key": "cc"
  },
  "cc by-nc-sa 4.0": {
    "version_key": "cc-by-nc-sa-4.0",
    "name_key": "cc-by-nc-sa",
    "family_key": "cc",
    "aliases": [
      "cc-by-nc-sa-4.0",
      "cc by nc-sa 4",
      "cc-by nc-sa 4",
      "cc-by-nc-sa 4",
      "cc by nc-sa-4",
      "cc-by nc-sa-4",
      "creative commons attribution-noncommercial-sharealike 4.0",
      "creative commons attribution-noncommercial-sharealike 4.0 international",
      "creative commons attribution-noncommercial-sharealike 4.0 international license",
      "creative commons by-nc-sa 4.0"
    ]
  },
  "cc by-nd": {
    "version_key": "cc-by-nd",
    "name_key": "cc-by-nd",
    "family_key": "cc",
    "aliases": [
      "cc-by-nd",
      "cc by nd",
      "cc-by nd",
      "creative commons by-nd",
      "creative commons attribution-noderivatives"
    ]
  },
  "cc by-nd 3.0": {
    "version_key": "cc-by-nd-3.0",
    "name_key": "cc-by-nd",
    "family_key": "cc"
  },
  "cc by-nd 4.0": {
    "version_key": "cc-by-nd-4.0",
    "name_key": "cc-by-nd",
    "family_key": "cc",
    "aliases": [
      "cc-by-nd-4.0",
      "cc by nd 4",
      "cc-by nd 4",
      "cc by nd-4",
      "cc-by nd-4",
      "cc-by-nd 4",
      "creative commons attribution-noderivatives 4.0",
      "creative commons attribution-noderivatives 4.0 international",
      "creative commons attribution-noderivatives 4.0 international license",
      "creative commons by-nd 4.0"
    ]
  },
  "cc by-sa": {
    "version_key": "cc-by-sa",
    "name_key": "cc-by-sa",
    "family_key": "cc",
    "aliases": [
      "cc-by-sa",
      "cc by sa",
      "cc-by sa",
      "creative commons attribution-sharealike",
      "creative commons by-sa"
    ]
  },
  "cc by-sa 3.0": {
    "version_key": "cc-by-sa-3.0",
    "name_key": "cc-by-sa",
    "family_key": "cc"
  },
  "cc by-sa 4.0": {
    "version_key": "cc-by-sa-4.0",
    "name_key": "cc-by-sa",
    "family_key": "cc",
    "aliases": [
      "cc-by-sa-4.0",
      "cc by sa 4",
      "cc-by sa 4",
      "cc by sa-4",
      "cc-by sa-4",
      "cc-by-sa 4",
      "creative commons attribution-sharealike 4.0",
      "creative commons attribution-sharealike 4.0 international",
      "creative commons attribution-sharealike 4.0 international license",
      "creative commons by-sa 4.0"
    ]
  },
  "cc-pdm 1.0": {
    "version_key": "cc-pdm-1.0",
    "name_key": "cc-pdm",
    "family_key": "public-domain",
    "aliases": [
      "cc-pdm-1.0",
      "cc pdm 1.0",
      "cc pdm-1.0",
      "cc-pdm",
      "cc pdm",
      "creative commons public domain",
      "creative commons public domain mark 1.0",
      "creative commons public domain mark"
    ]
  },
  "cc0 1.0": {
    "version_key": "cc0-1.0",
    "name_key": "cc0",
    "family_key": "cc0",
    "aliases": [
      "cc0-1.0",
      "cc-zero 1.0",
      "cc zero 1.0",
      "creative commons zero 1.0",
      "cc0",
      "cc 0",
      "cc zero",
      "creative commons zero",
      "cc-zero"
    ]
  },
  "cup-terms": {
    "version_key": "cup-terms",
    "name_key": "cup-terms",
    "family_key": "publisher-proprietary",
    "aliases": [
      "cambridge terms"
    ]
  },
  "degruyter-terms": {
    "version_key": "degruyter-terms",
    "name_key": "degruyter-terms",
    "family_key": "publisher-proprietary",
    "aliases": [
      "de gruyter terms"
    ]
  },
  "elsevier oa": {
    "version_key": "elsevier-oa",
    "name_key": "elsevier-oa",
    "family_key": "publisher-oa",
    "aliases": [
      "elsevier-oa",
      "elsevier user license"
    ]
  },
  "elsevier tdm": {
    "version_key": "elsevier-tdm",
    "name_key": "elsevier-tdm",
    "family_key": "publisher-tdm",
    "aliases": [
      "elsevier tdmu",
      "elsevier-tdm"
    ]
  },
  "gpl-2": {
    "version_key": "gpl-2.0",
    "name_key": "gpl-2",
    "family_key": "copyleft",
    "aliases": [
      "gpl-v2",
      "gpl 2",
      "gnu gpl v2",
      "gpl v2",
      "gpl-2.0+"
    ]
  },
  "gpl-3": {
    "version_key": "gpl-3.0",
    "name_key": "gpl-3",
    "family_key": "copyleft",
    "aliases": [
      "gpl-v3",
      "gpl v3 only",
      "gpl 3",
      "gnu gpl",
      "gnu gpl v3",
      "gpl",
      "gpl v3",
      "gpl-3.0+"
    ],
    "justification": "gnu gpl, gnu gpl v3, gpl, gpl v3, gpl-3, and gpl-3.0+ are all standard aliases for GPL-3.0."
  },
  "implied oa": {
    "version_key": "implied-oa",
    "name_key": "implied-oa",
    "family_key": "publisher-oa",
    "aliases": [
      "implied open access",
      "implied-oa"
    ]
  },
  "iop-copyright": {
    "version_key": "iop-copyright",
    "name_key": "iop-copyright",
    "family_key": "publisher-proprietary"
  },
  "iop-tdm": {
    "version_key": "iop-tdm",
    "name_key": "iop-tdm",
    "family_key": "publisher-tdm",
    "aliases": [
      "iop text and data mining"
    ]
  },
  "isc license": {
    "version_key": "isc",
    "name_key": "isc",
    "family_key": "osi"
  },
  "jama-cc-by": {
    "version_key": "jama-cc-by",
    "name_key": "jama-cc-by",
    "family_key": "publisher-oa",
    "aliases": [
      "jama open access"
    ]
  },
  "lgpl": {
    "version_key": "lgpl-3.0",
    "name_key": "lgpl-3",
    "family_key": "copyleft"
  },
  "lgpl v2.1": {
    "version_key": "lgpl-2.1",
    "name_key": "lgpl-2.1",
    "family_key": "copyleft"
  },
  "lgpl v3": {
    "version_key": "lgpl-3.0",
    "name_key": "lgpl-3",
    "family_key": "copyleft"
  },
  "lgpl-2": {
    "version_key": "lgpl-2.1",
    "name_key": "lgpl-2.1",
    "family_key": "copyleft",
    "aliases": [
      "lgpl-v2",
      "lgpl 2",
      "lgpl-2.1-only",
      "lgpl-2.1-or-later"
    ]
  },
  "lgpl-2.1+": {
    "version_key": "lgpl-2.1",
    "name_key": "lgpl-2.1",
    "family_key": "copyleft"
  },
  "lgpl-3": {
    "version_key": "lgpl-3.0",
    "name_key": "lgpl-3",
    "family_key": "copyleft",
    "aliases": [
      "lgpl-v3",
      "lgpl 3"
    ]
  },
  "lgpl-3.0+": {
    "version_key": "lgpl-3.0",
    "name_key": "lgpl-3",
    "family_key": "copyleft"
  },
  "mit license": {
    "version_key": "mit",
    "name_key": "mit",
    "family_key": "osi",
    "aliases": [
      "the mit license"
    ]
  },
  "mozilla public license 2.0": {
    "version_key": "mpl-2.0",
    "name_key": "mpl",
    "family_key": "osi",
    "aliases": [
      "mpl",
      "mpl-2.0",
      "mpl 2.0",
      "mozilla license",
      "mozilla public license",
      "mozilla"
    ]
  },
  "no reuse": {
    "version_key": "no-reuse",
    "name_key": "no-reuse",
    "family_key": "publisher-proprietary",
    "aliases": [
      "no-reuse"
    ]
  },
  "odbl": {
    "version_key": "odbl",
    "name_key": "odbl",
    "family_key": "open-data",
    "aliases": [
      "open database license"
    ]
  },
  "odc-by": {
    "version_key": "odc-by",
    "name_key": "odc-by",
    "family_key": "open-data"
  },
  "other-oa": {
    "version_key": "other-oa",
    "name_key": "other-oa",
    "family_key": "other-oa",
    "aliases": [
      "open access",
      "open-access"
    ]
  },
  "oup-chorus": {
    "version_key": "oup-chorus",
    "name_key": "oup-chorus",
    "family_key": "publisher-oa"
  },
  "oup-terms": {
    "version_key": "oup-terms",
    "name_key": "oup-terms",
    "family_key": "publisher-proprietary",
    "aliases": [
      "oup standard publication"
    ]
  },
  "pd": {
    "version_key": "public-domain",
    "name_key": "public-domain",
    "family_key": "public-domain",
    "aliases": [
      "public domain",
      "public-domain"
    ]
  },
  "pddl": {
    "version_key": "pddl",
    "name_key": "pddl",
    "family_key": "open-data"
  },
  "pnas terms": {
    "version_key": "pnas-licenses",
    "name_key": "pnas-licenses",
    "family_key": "publisher-proprietary",
    "aliases": [
      "pnas-licenses"
    ]
  },
  "rsc-terms": {
    "version_key": "rsc-terms",
    "name_key": "rsc-terms",
    "family_key": "publisher-proprietary"
  },
  "sage-permissions": {
    "version_key": "sage-permissions",
    "name_key": "sage-permissions",
    "family_key": "publisher-proprietary"
  },
  "springer tdm": {
    "version_key": "springer-tdm",
    "name_key": "springer-tdm",
    "family_key": "publisher-tdm",
    "aliases": [
      "springer-tdm"
    ]
  },
  "springernature-tdm": {
    "version_key": "springernature-tdm",
    "name_key": "springernature-tdm",
    "family_key": "publisher-tdm",
    "aliases": [
      "springer nature tdm",
      "springer nature text and data mining"
    ]
  },
  "tandf-terms": {
    "version_key": "tandf-terms",
    "name_key": "tandf-terms",
    "family_key": "publisher-proprietary",
    "aliases": [
      "taylor and francis terms",
      "taylor francis terms"
    ]
  },
  "thieme nlm": {
    "version_key": "thieme-nlm",
    "name_key": "thieme-nlm",
    "family_key": "publisher-oa",
    "aliases": [
      "thieme-nlm"
    ]
  },
  "unlicense": {
    "version_key": "unlicense",
    "name_key": "unlicense",
    "family_key": "osi"
  },
  "unspecified oa": {
    "version_key": "unspecified-oa",
    "name_key": "unspecified-oa",
    "family_key": "other-oa",
    "aliases": [
      "unspecified-oa"
    ]
  },
  "wiley terms": {
    "version_key": "wiley-terms",
    "name_key": "wiley-terms",
    "family_key": "publisher-proprietary",
    "aliases": [
      "wiley-terms"
    ]
  },
  "wiley-am": {
    "version_key": "wiley-am",
    "name_key": "wiley-am",
    "family_key": "publisher-proprietary",
    "aliases": [
      "wiley author manuscript"
    ]
  },
  "wiley-tdm": {
    "version_key": "wiley-tdm",
    "name_key": "wiley-tdm",
    "family_key": "publisher-tdm",
    "aliases": [
      "wiley tdm license"
    ]
  },
  "wiley-vor": {
    "version_key": "wiley-vor",
    "name_key": "wiley-vor",
    "family_key": "publisher-proprietary"
  },
  "wtfpl": {
    "version_key": "wtfpl",
    "name_key": "wtfpl",
    "family_key": "osi"
  },
  "zlib": {
    "version_key": "zlib",
    "name_key": "zlib",
    "family_key": "osi"
  },
  "© the author(s)": {
    "version_key": "publisher-specific-oa",
    "name_key": "publisher-specific-oa",
    "family_key": "publisher-oa",
    "aliases": [
      "publisher specific oa",
      "publisher-specific-oa"
    ]
  }
}

src/licence_normaliser/data/prose/prose_patterns.json

[
  {"pattern": "cc\\s*by-nc-nd\\s*4\\.0", "version_key": "cc-by-nc-nd-4.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
  {"pattern": "cc\\s*by-nc-nd\\s*3\\.0", "version_key": "cc-by-nc-nd-3.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
  {"pattern": "cc\\s*by-nc-nd", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"},
  {"pattern": "cc\\s*by-nc-sa\\s*4\\.0", "version_key": "cc-by-nc-sa-4.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
  {"pattern": "cc\\s*by-nc-sa\\s*3\\.0", "version_key": "cc-by-nc-sa-3.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
  {"pattern": "creative\\s+commons\\s+by-nc-sa", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"},
  {"pattern": "creative\\s+commons\\s+by-nc-nd", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"},
  {"pattern": "creative\\s+commons\\s+by-nc", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"},
  {"pattern": "creative\\s+commons\\s+by-nd", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"},
  {"pattern": "creative\\s+commons\\s+by-sa", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"},
  {"pattern": "creative\\s+commons\\s+by", "version_key": "cc-by", "name_key": "cc-by", "family_key": "cc"},
  {"pattern": "cc\\s*by-nc-sa", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"},
  {"pattern": "cc\\s*by-nc\\s*4\\.0", "version_key": "cc-by-nc-4.0", "name_key": "cc-by-nc", "family_key": "cc"},
  {"pattern": "cc\\s*by-nc\\s*3\\.0", "version_key": "cc-by-nc-3.0", "name_key": "cc-by-nc", "family_key": "cc"},
  {"pattern": "cc\\s*by-nc", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"},
  {"pattern": "cc\\s*by-sa\\s*4\\.0", "version_key": "cc-by-sa-4.0", "name_key": "cc-by-sa", "family_key": "cc"},
  {"pattern": "cc\\s*by-sa\\s*3\\.0", "version_key": "cc-by-sa-3.0", "name_key": "cc-by-sa", "family_key": "cc"},
  {"pattern": "cc\\s*by-sa", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"},
  {"pattern": "cc\\s*by-nd\\s*4\\.0", "version_key": "cc-by-nd-4.0", "name_key": "cc-by-nd", "family_key": "cc"},
  {"pattern": "cc\\s*by-nd\\s*3\\.0", "version_key": "cc-by-nd-3.0", "name_key": "cc-by-nd", "family_key": "cc"},
  {"pattern": "cc\\s*by-nd", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"},
  {"pattern": "cc\\s*by\\s*4\\.0", "version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc"},
  {"pattern": "cc\\s*by\\s*3\\.0", "version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"},
  {"pattern": "cc\\s*by\\s*2\\.0", "version_key": "cc-by-2.0", "name_key": "cc-by", "family_key": "cc"},
  {"pattern": "\\bcc\\s*by\\b(?!\\s*-)", "version_key": "cc-by", "name_key": "cc-by", "family_key": "cc"},
  {"pattern": "\\bcc\\s*0\\b|cc\\s*zero", "version_key": "cc0", "name_key": "cc0", "family_key": "cc0"},
  {"pattern": "attribution.{0,30}non.?commercial.{0,30}no.?deriv", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"},
  {"pattern": "attribution.{0,30}non.?commercial.{0,30}share.?alike", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"},
  {"pattern": "attribution.{0,30}non.?commercial", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"},
  {"pattern": "attribution.{0,30}no.?deriv", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"},
  {"pattern": "attribution.{0,30}share.?alike", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"},

  {"pattern": "elsevier.*tdm|tdm.*elsevier", "version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"},
  {"pattern": "elsevier.*user\\s*licen", "version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"},
  {"pattern": "wiley.*tdm|tdm.*wiley", "version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm"},
  {"pattern": "springer.*tdm|tdm.*springer", "version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"},
  {"pattern": "acs\\s*authorchoice.*cc\\s*by(?!-nc)", "version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa"},
  {"pattern": "acs\\s*authorchoice", "version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"},
  {"pattern": "all\\s*rights\\s*reserved", "version_key": "all-rights-reserved", "name_key": "all-rights-reserved", "family_key": "publisher-proprietary"},
  {"pattern": "author\\s*manuscript", "version_key": "author-manuscript", "name_key": "author-manuscript", "family_key": "publisher-oa"},
  {"pattern": "public\\s*domain", "version_key": "public-domain", "name_key": "public-domain", "family_key": "public-domain"},
  {"pattern": "open\\s*access", "version_key": "other-oa", "name_key": "other-oa", "family_key": "other-oa"}
]

src/licence_normaliser/data/publishers/publishers.json

{
  "_comment": "Publisher-specific license URLs and shorthand aliases.",
  "_comment2": "URLs: normalized to https with no trailing slash on lookup.",
  "_comment3": "Aliases: cleaned-lowercase form -> version_key.",

  "urls": {
    "https://www.elsevier.com/open-access/userlicense/1.0/": {
      "version_key": "elsevier-oa",
      "name_key": "elsevier-oa",
      "family_key": "publisher-oa"
    },
    "http://www.elsevier.com/open-access/userlicense/1.0/": {
      "version_key": "elsevier-oa",
      "name_key": "elsevier-oa",
      "family_key": "publisher-oa"
    },
    "https://www.elsevier.com/tdm/userlicense/1.0/": {
      "version_key": "elsevier-tdm",
      "name_key": "elsevier-tdm",
      "family_key": "publisher-tdm"
    },
    "http://www.elsevier.com/tdm/userlicense/1.0/": {
      "version_key": "elsevier-tdm",
      "name_key": "elsevier-tdm",
      "family_key": "publisher-tdm"
    },
    "http://doi.wiley.com/10.1002/tdm_license_1": {
      "version_key": "wiley-tdm",
      "name_key": "wiley-tdm",
      "family_key": "publisher-tdm"
    },
    "http://doi.wiley.com/10.1002/tdm_license_1.1": {
      "version_key": "wiley-tdm-1.1",
      "name_key": "wiley-tdm",
      "family_key": "publisher-tdm"
    },
    "http://onlinelibrary.wiley.com/termsAndConditions#vor": {
      "version_key": "wiley-vor",
      "name_key": "wiley-vor",
      "family_key": "publisher-proprietary"
    },
    "http://onlinelibrary.wiley.com/termsAndConditions#am": {
      "version_key": "wiley-am",
      "name_key": "wiley-am",
      "family_key": "publisher-proprietary"
    },
    "https://onlinelibrary.wiley.com/termsandconditions#vor": {
      "version_key": "wiley-vor",
      "name_key": "wiley-vor",
      "family_key": "publisher-proprietary"
    },
    "https://onlinelibrary.wiley.com/termsandconditions#am": {
      "version_key": "wiley-am",
      "name_key": "wiley-am",
      "family_key": "publisher-proprietary"
    },
    "https://onlinelibrary.wiley.com/termsandconditions": {
      "version_key": "wiley-terms",
      "name_key": "wiley-terms",
      "family_key": "publisher-proprietary"
    },
    "https://onlinelibrary.wiley.com/terms-and-conditions": {
      "version_key": "wiley-terms",
      "name_key": "wiley-terms",
      "family_key": "publisher-proprietary"
    },
    "https://www.springer.com/tdm": {
      "version_key": "springer-tdm",
      "name_key": "springer-tdm",
      "family_key": "publisher-tdm"
    },
    "http://www.springer.com/tdm": {
      "version_key": "springer-tdm",
      "name_key": "springer-tdm",
      "family_key": "publisher-tdm"
    },
    "https://www.springernature.com/gp/researchers/text-and-data-mining": {
      "version_key": "springernature-tdm",
      "name_key": "springernature-tdm",
      "family_key": "publisher-tdm"
    },
    "https://www.tandfonline.com/action/showCopyRight": {
      "version_key": "tandf-terms",
      "name_key": "tandf-terms",
      "family_key": "publisher-proprietary"
    },
    "https://www.tandfonline.com/action/showcopyright": {
      "version_key": "tandf-terms",
      "name_key": "tandf-terms",
      "family_key": "publisher-proprietary"
    },
    "https://tandfonline.com/action/showcopyright": {
      "version_key": "tandf-terms",
      "name_key": "tandf-terms",
      "family_key": "publisher-proprietary"
    },
    "https://www.tandfonline.com/action/showcopyright?show=full": {
      "version_key": "tandf-terms",
      "name_key": "tandf-terms",
      "family_key": "publisher-proprietary"
    },
    "https://us.sagepub.com/en-us/nam/journals-permissions": {
      "version_key": "sage-permissions",
      "name_key": "sage-permissions",
      "family_key": "publisher-proprietary"
    },
    "https://www.sagepub.com/journalspermissions.nav": {
      "version_key": "sage-permissions",
      "name_key": "sage-permissions",
      "family_key": "publisher-proprietary"
    },
    "https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {
      "version_key": "acs-authorchoice-ccby",
      "name_key": "acs-authorchoice-ccby",
      "family_key": "publisher-oa"
    },
    "http://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {
      "version_key": "acs-authorchoice-ccby",
      "name_key": "acs-authorchoice-ccby",
      "family_key": "publisher-oa"
    },
    "https://pubs.acs.org/page/policy/authorchoice_ccbyncnd_termsofuse.html": {
      "version_key": "acs-authorchoice-ccbyncnd",
      "name_key": "acs-authorchoice-ccbyncnd",
      "family_key": "publisher-oa"
    },
    "https://pubs.acs.org/page/policy/authorchoice_termsofuse.html": {
      "version_key": "acs-authorchoice",
      "name_key": "acs-authorchoice",
      "family_key": "publisher-oa"
    },
    "https://pubs.acs.org/page/policy/acs_authorchoice_with_nih_addendum_termsofuse.html": {
      "version_key": "acs-authorchoice-nih",
      "name_key": "acs-authorchoice-nih",
      "family_key": "publisher-oa"
    },
    "https://doi.org/10.1021/policy/oa-license": {
      "version_key": "acs-authorchoice",
      "name_key": "acs-authorchoice",
      "family_key": "publisher-oa"
    },
    "https://www.rsc.org/journals-books-databases/journal-authors-reviewers/licences-copyright-permissions/": {
      "version_key": "rsc-terms",
      "name_key": "rsc-terms",
      "family_key": "publisher-proprietary"
    },
    "https://www.rsc.org/help/disclaimer/pages/term3.aspx": {
      "version_key": "rsc-terms",
      "name_key": "rsc-terms",
      "family_key": "publisher-proprietary"
    },
    "https://iopscience.iop.org/info/page/text-and-data-mining": {
      "version_key": "iop-tdm",
      "name_key": "iop-tdm",
      "family_key": "publisher-tdm"
    },
    "http://iopscience.iop.org/info/page/text-and-data-mining": {
      "version_key": "iop-tdm",
      "name_key": "iop-tdm",
      "family_key": "publisher-tdm"
    },
    "https://iopscience.iop.org/page/copyright": {
      "version_key": "iop-copyright",
      "name_key": "iop-copyright",
      "family_key": "publisher-proprietary"
    },
    "https://www.bmj.com/company/legal-stuff/copyright-notice/": {
      "version_key": "bmj-copyright",
      "name_key": "bmj-copyright",
      "family_key": "publisher-proprietary"
    },
    "https://group.bmj.com/group/rights-licensing/permissions": {
      "version_key": "bmj-copyright",
      "name_key": "bmj-copyright",
      "family_key": "publisher-proprietary"
    },
    "https://www.science.org/content/page/science-licenses-journal-article-reuse": {
      "version_key": "aaas-author-reuse",
      "name_key": "aaas-author-reuse",
      "family_key": "publisher-proprietary"
    },
    "https://www.sciencemag.org/about/science-licenses-journal-article-reuse": {
      "version_key": "aaas-author-reuse",
      "name_key": "aaas-author-reuse",
      "family_key": "publisher-proprietary"
    },
    "https://www.pnas.org/site/aboutpnas/licenses.xhtml": {
      "version_key": "pnas-licenses",
      "name_key": "pnas-licenses",
      "family_key": "publisher-proprietary"
    },
    "https://link.aps.org/licenses/aps-default-license": {
      "version_key": "aps-default",
      "name_key": "aps-default",
      "family_key": "publisher-proprietary"
    },
    "https://link.aps.org/licenses/aps-default-text-mining-license": {
      "version_key": "aps-tdm",
      "name_key": "aps-tdm",
      "family_key": "publisher-tdm"
    },
    "https://www.cambridge.org/core/terms": {
      "version_key": "cup-terms",
      "name_key": "cup-terms",
      "family_key": "publisher-proprietary"
    },
    "https://publishing.aip.org/authors/rights-and-permissions": {
      "version_key": "aip-rights",
      "name_key": "aip-rights",
      "family_key": "publisher-proprietary"
    },
    "http://publishing.aip.org/authors/rights-and-permissions": {
      "version_key": "aip-rights",
      "name_key": "aip-rights",
      "family_key": "publisher-proprietary"
    },
    "https://jamanetwork.com/pages/cc-by-license-permissions": {
      "version_key": "jama-cc-by",
      "name_key": "jama-cc-by",
      "family_key": "publisher-oa"
    },
    "https://www.degruyter.com/dg/page/496": {
      "version_key": "degruyter-terms",
      "name_key": "degruyter-terms",
      "family_key": "publisher-proprietary"
    },
    "https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model": {
      "version_key": "oup-chorus",
      "name_key": "oup-chorus",
      "family_key": "publisher-oa"
    },
    "https://academic.oup.com/pages/standard-publication-reuse-rights": {
      "version_key": "oup-terms",
      "name_key": "oup-terms",
      "family_key": "publisher-proprietary"
    },
    "https://www.gnu.org/licenses/gpl-2.0.html": {
      "version_key": "gpl-2.0",
      "name_key": "gpl-2",
      "family_key": "copyleft"
    },
    "https://www.gnu.org/licenses/gpl-3.0.html": {
      "version_key": "gpl-3.0",
      "name_key": "gpl-3",
      "family_key": "copyleft"
    },
    "https://www.gnu.org/licenses/agpl-3.0.html": {
      "version_key": "agpl-3.0",
      "name_key": "agpl-3",
      "family_key": "copyleft"
    },
    "https://www.gnu.org/licenses/lgpl-2.1.html": {
      "version_key": "lgpl-2.1",
      "name_key": "lgpl-2.1",
      "family_key": "copyleft"
    },
    "https://www.gnu.org/licenses/lgpl-3.0.html": {
      "version_key": "lgpl-3.0",
      "name_key": "lgpl-3",
      "family_key": "copyleft"
    },
    "https://opendatacommons.org/licenses/odbl/1-0/": {
      "version_key": "odbl",
      "name_key": "odbl",
      "family_key": "open-data"
    },
    "https://opendatacommons.org/licenses/by/1-0/": {
      "version_key": "odc-by",
      "name_key": "odc-by",
      "family_key": "open-data"
    },
    "https://opendatacommons.org/licenses/pddl/1-0/": {
      "version_key": "pddl",
      "name_key": "pddl",
      "family_key": "open-data"
    }
  },

  "shorthand_aliases": {
    "elsevier user license": "elsevier-oa",
    "elsevier tdm": "elsevier-tdm",
    "elsevier tdmu": "elsevier-tdm",
    "wiley tdm license": "wiley-tdm",
    "wiley tdm": "wiley-tdm",
    "wiley vor": "wiley-vor",
    "wiley am": "wiley-am",
    "wiley author manuscript": "wiley-am",
    "springer tdm": "springer-tdm",
    "springer nature tdm": "springernature-tdm",
    "springer nature text and data mining": "springernature-tdm",
    "tandf terms": "tandf-terms",
    "taylor and francis terms": "tandf-terms",
    "taylor francis terms": "tandf-terms",
    "sage permissions": "sage-permissions",
    "acs authorchoice": "acs-authorchoice",
    "acs author choice": "acs-authorchoice",
    "acs authorchoice cc by": "acs-authorchoice-ccby",
    "acs authorchoice cc by nc nd": "acs-authorchoice-ccbyncnd",
    "acs authorchoice nih": "acs-authorchoice-nih",
    "rsc terms": "rsc-terms",
    "rsc copyright": "rsc-terms",
    "iop tdm": "iop-tdm",
    "iop text and data mining": "iop-tdm",
    "iop copyright": "iop-copyright",
    "bmj copyright": "bmj-copyright",
    "bmj permissions": "bmj-copyright",
    "aaas author reuse": "aaas-author-reuse",
    "aaas reuse": "aaas-author-reuse",
    "science author reuse": "aaas-author-reuse",
    "pnas licenses": "pnas-licenses",
    "pnas terms": "pnas-licenses",
    "aps default": "aps-default",
    "aps tdm": "aps-tdm",
    "aps text mining": "aps-tdm",
    "aps default license": "aps-default",
    "cambridge terms": "cup-terms",
    "cup terms": "cup-terms",
    "aip rights": "aip-rights",
    "aip permissions": "aip-rights",
    "jama cc by": "jama-cc-by",
    "jama open access": "jama-cc-by",
    "degruyter terms": "degruyter-terms",
    "de gruyter terms": "degruyter-terms",
    "oup chorus": "oup-chorus",
    "oup terms": "oup-terms",
    "oup standard publication": "oup-terms",
    "thieme nlm": "thieme-nlm",
    "implied oa": "implied-oa",
    "implied open access": "implied-oa",
    "unspecified oa": "unspecified-oa",
    "publisher specific oa": "publisher-specific-oa",
    "author manuscript": "author-manuscript",
    "all rights reserved": "all-rights-reserved",
    "no reuse": "no-reuse",
    "public domain": "public-domain",
    "open access": "other-oa",
    "creative commons public domain": "cc-pdm-1.0",
    "pd": "public-domain"
  }
}

src/licence_normaliser/data/urls/url_map.json

{
  "_comment": "URL -> metadata dict. Both http and https variants may be listed.",
  "_comment2": "Normalisation (https, no trailing slash) is applied on load.",

  "https://creativecommons.org/licenses/by/4.0/": {"version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc"},
  "https://creativecommons.org/licenses/by/3.0/": {"version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"},
  "https://creativecommons.org/licenses/by/2.5/": {"version_key": "cc-by-2.5", "name_key": "cc-by", "family_key": "cc"},
  "https://creativecommons.org/licenses/by/2.0/": {"version_key": "cc-by-2.0", "name_key": "cc-by", "family_key": "cc"},
  "https://creativecommons.org/licenses/by/1.0/": {"version_key": "cc-by-1.0", "name_key": "cc-by", "family_key": "cc"},
  "https://creativecommons.org/licenses/by/3.0/deed.en_us": {"version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"},

  "https://creativecommons.org/licenses/by-sa/4.0/": {"version_key": "cc-by-sa-4.0", "name_key": "cc-by-sa", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-sa/3.0/": {"version_key": "cc-by-sa-3.0", "name_key": "cc-by-sa", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-sa/2.5/": {"version_key": "cc-by-sa-2.5", "name_key": "cc-by-sa", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-sa/2.0/": {"version_key": "cc-by-sa-2.0", "name_key": "cc-by-sa", "family_key": "cc"},

  "https://creativecommons.org/licenses/by-nd/4.0/": {"version_key": "cc-by-nd-4.0", "name_key": "cc-by-nd", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nd/3.0/": {"version_key": "cc-by-nd-3.0", "name_key": "cc-by-nd", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nd/2.0/": {"version_key": "cc-by-nd-2.0", "name_key": "cc-by-nd", "family_key": "cc"},

  "https://creativecommons.org/licenses/by-nc/4.0/": {"version_key": "cc-by-nc-4.0", "name_key": "cc-by-nc", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc/3.0/": {"version_key": "cc-by-nc-3.0", "name_key": "cc-by-nc", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc/2.5/": {"version_key": "cc-by-nc-2.5", "name_key": "cc-by-nc", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc/2.0/": {"version_key": "cc-by-nc-2.0", "name_key": "cc-by-nc", "family_key": "cc"},

  "https://creativecommons.org/licenses/by-nc-sa/4.0/": {"version_key": "cc-by-nc-sa-4.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc-sa/3.0/": {"version_key": "cc-by-nc-sa-3.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc-sa/2.5/": {"version_key": "cc-by-nc-sa-2.5", "name_key": "cc-by-nc-sa", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc-sa/2.0/": {"version_key": "cc-by-nc-sa-2.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},

  "https://creativecommons.org/licenses/by-nc-nd/4.0/": {"version_key": "cc-by-nc-nd-4.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc-nd/3.0/": {"version_key": "cc-by-nc-nd-3.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc-nd/2.5/": {"version_key": "cc-by-nc-nd-2.5", "name_key": "cc-by-nc-nd", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc-nd/2.0/": {"version_key": "cc-by-nc-nd-2.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},

  "https://creativecommons.org/licenses/by/3.0/igo/": {"version_key": "cc-by-3.0-igo", "name_key": "cc-by-igo", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc-sa/3.0/igo/": {"version_key": "cc-by-nc-sa-3.0-igo", "name_key": "cc-by-nc-sa-igo", "family_key": "cc"},
  "https://creativecommons.org/licenses/by-nc-nd/3.0/igo/": {"version_key": "cc-by-nc-nd-3.0-igo", "name_key": "cc-by-nc-nd-igo", "family_key": "cc"},

  "https://creativecommons.org/publicdomain/zero/1.0/": {"version_key": "cc0", "name_key": "cc0", "family_key": "cc0"},
  "https://creativecommons.org/publicdomain/mark/1.0/": {"version_key": "cc-pdm", "name_key": "cc-pdm", "family_key": "public-domain"},

  "https://www.gnu.org/licenses/gpl-2.0.html": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"},
  "https://www.gnu.org/licenses/gpl-2.0": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"},
  "http://www.gnu.org/licenses/gpl-2.0.html": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"},
  "https://www.gnu.org/licenses/gpl-3.0.html": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"},
  "https://www.gnu.org/licenses/gpl-3.0": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"},
  "http://www.gnu.org/licenses/gpl-3.0.html": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"},
  "https://www.gnu.org/licenses/agpl-3.0.html": {"version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft"},
  "https://www.gnu.org/licenses/agpl-3.0": {"version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft"},
  "https://www.gnu.org/licenses/lgpl-2.1.html": {"version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft"},
  "https://www.gnu.org/licenses/lgpl-2.1": {"version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft"},
  "https://www.gnu.org/licenses/lgpl-3.0.html": {"version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft"},
  "https://www.gnu.org/licenses/lgpl-3.0": {"version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft"},

  "https://opensource.org/licenses/MIT": {"version_key": "mit", "name_key": "mit", "family_key": "osi"},
  "https://www.apache.org/licenses/LICENSE-2.0": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"},
  "https://www.apache.org/licenses/LICENSE-2.0.html": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"},
  "https://opensource.org/licenses/Apache-2.0": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"},
  "https://opensource.org/licenses/BSD-2-Clause": {"version_key": "bsd-2-clause", "name_key": "bsd-2-clause", "family_key": "osi"},
  "https://opensource.org/licenses/BSD-3-Clause": {"version_key": "bsd-3-clause", "name_key": "bsd-3-clause", "family_key": "osi"},
  "https://opensource.org/licenses/ISC": {"version_key": "isc", "name_key": "isc", "family_key": "osi"},
  "https://www.mozilla.org/en-US/MPL/2.0/": {"version_key": "mpl-2.0", "name_key": "mpl", "family_key": "osi"},
  "https://www.mozilla.org/MPL/2.0/": {"version_key": "mpl-2.0", "name_key": "mpl", "family_key": "osi"},

  "https://www.elsevier.com/open-access/userlicense/1.0/": {"version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"},
  "http://www.elsevier.com/open-access/userlicense/1.0/": {"version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"},
  "https://www.elsevier.com/tdm/userlicense/1.0/": {"version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"},
  "http://www.elsevier.com/tdm/userlicense/1.0/": {"version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"},

  "http://doi.wiley.com/10.1002/tdm_license_1": {"version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm"},
  "http://doi.wiley.com/10.1002/tdm_license_1.1": {"version_key": "wiley-tdm-1.1", "name_key": "wiley-tdm", "family_key": "publisher-tdm"},
  "http://onlinelibrary.wiley.com/termsAndConditions#vor": {"version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary"},
  "http://onlinelibrary.wiley.com/termsAndConditions#am": {"version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary"},
  "https://onlinelibrary.wiley.com/termsandconditions#vor": {"version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary"},
  "https://onlinelibrary.wiley.com/termsandconditions#am": {"version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary"},
  "https://onlinelibrary.wiley.com/termsandconditions": {"version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary"},
  "https://onlinelibrary.wiley.com/terms-and-conditions": {"version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary"},

  "https://www.springer.com/tdm": {"version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"},
  "http://www.springer.com/tdm": {"version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"},
  "https://www.springernature.com/gp/researchers/text-and-data-mining": {"version_key": "springernature-tdm", "name_key": "springernature-tdm", "family_key": "publisher-tdm"},

  "https://www.tandfonline.com/action/showCopyRight": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
  "https://www.tandfonline.com/action/showcopyright": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
  "https://tandfonline.com/action/showcopyright": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
  "https://www.tandfonline.com/action/showcopyright?show=full": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},

  "https://us.sagepub.com/en-us/nam/journals-permissions": {"version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary"},
  "https://www.sagepub.com/journalspermissions.nav": {"version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary"},

  "https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {"version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa"},
  "https://pubs.acs.org/page/policy/authorchoice_ccbyncnd_termsofuse.html": {"version_key": "acs-authorchoice-ccbyncnd", "name_key": "acs-authorchoice-ccbyncnd", "family_key": "publisher-oa"},
  "https://pubs.acs.org/page/policy/authorchoice_termsofuse.html": {"version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"},
  "https://pubs.acs.org/page/policy/acs_authorchoice_with_nih_addendum_termsofuse.html": {"version_key": "acs-authorchoice-nih", "name_key": "acs-authorchoice-nih", "family_key": "publisher-oa"},
  "https://doi.org/10.1021/policy/oa-license": {"version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"},

  "https://www.rsc.org/journals-books-databases/journal-authors-reviewers/licences-copyright-permissions/": {"version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary"},
  "https://www.rsc.org/help/disclaimer/pages/term3.aspx": {"version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary"},

  "https://iopscience.iop.org/info/page/text-and-data-mining": {"version_key": "iop-tdm", "name_key": "iop-tdm", "family_key": "publisher-tdm"},
  "https://iopscience.iop.org/page/copyright": {"version_key": "iop-copyright", "name_key": "iop-copyright", "family_key": "publisher-proprietary"},

  "https://www.bmj.com/company/legal-stuff/copyright-notice/": {"version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary"},
  "https://group.bmj.com/group/rights-licensing/permissions": {"version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary"},

  "https://www.science.org/content/page/science-licenses-journal-article-reuse": {"version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary"},
  "https://www.sciencemag.org/about/science-licenses-journal-article-reuse": {"version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary"},

  "https://www.pnas.org/site/aboutpnas/licenses.xhtml": {"version_key": "pnas-licenses", "name_key": "pnas-licenses", "family_key": "publisher-proprietary"},

  "https://link.aps.org/licenses/aps-default-license": {"version_key": "aps-default", "name_key": "aps-default", "family_key": "publisher-proprietary"},
  "https://link.aps.org/licenses/aps-default-text-mining-license": {"version_key": "aps-tdm", "name_key": "aps-tdm", "family_key": "publisher-tdm"},

  "https://www.cambridge.org/core/terms": {"version_key": "cup-terms", "name_key": "cup-terms", "family_key": "publisher-proprietary"},

  "https://publishing.aip.org/authors/rights-and-permissions": {"version_key": "aip-rights", "name_key": "aip-rights", "family_key": "publisher-proprietary"},

  "https://jamanetwork.com/pages/cc-by-license-permissions": {"version_key": "jama-cc-by", "name_key": "jama-cc-by", "family_key": "publisher-oa"},

  "https://www.degruyter.com/dg/page/496": {"version_key": "degruyter-terms", "name_key": "degruyter-terms", "family_key": "publisher-proprietary"},

  "https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model": {"version_key": "oup-chorus", "name_key": "oup-chorus", "family_key": "publisher-oa"},
  "https://academic.oup.com/pages/standard-publication-reuse-rights": {"version_key": "oup-terms", "name_key": "oup-terms", "family_key": "publisher-proprietary"},

  "https://opendatacommons.org/licenses/odbl/1-0/": {"version_key": "odbl", "name_key": "odbl", "family_key": "open-data"},
  "https://opendatacommons.org/licenses/by/1-0/": {"version_key": "odc-by", "name_key": "odc-by", "family_key": "open-data"},
  "https://opendatacommons.org/licenses/pddl/1-0/": {"version_key": "pddl", "name_key": "pddl", "family_key": "open-data"}
}

src/licence_normaliser/defaults.py

"""Default plugin configuration.

These are the plugin CLASSES (not instances) that form the sane defaults.
Pass them to LicenseNormaliser - they're instantiated lazily.
"""

from __future__ import annotations

from collections.abc import Mapping
from typing import Iterator

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"

__all__ = (
    "DEFAULT_PLUGINS",
    "DEFAULT_PLUGIN_KEYS",
    "get_all_refreshable_plugins",
)

DEFAULT_PLUGIN_KEYS = ("registry", "url", "alias", "family", "name", "prose")


def get_all_refreshable_plugins() -> list[type]:
    """Return all plugin classes that support refresh (have url set)."""
    from .parsers.creativecommons import CreativeCommonsParser
    from .parsers.opendefinition import OpenDefinitionParser
    from .parsers.osi import OSIParser
    from .parsers.scancode_licensedb import ScanCodeLicenseDBParser
    from .parsers.spdx import SPDXParser

    return [
        SPDXParser,
        OpenDefinitionParser,
        OSIParser,
        ScanCodeLicenseDBParser,
        CreativeCommonsParser,
    ]


def _load_registry_plugins() -> list[type]:
    from .parsers.creativecommons import CreativeCommonsParser
    from .parsers.opendefinition import OpenDefinitionParser
    from .parsers.osi import OSIParser
    from .parsers.scancode_licensedb import ScanCodeLicenseDBParser
    from .parsers.spdx import SPDXParser

    return [
        SPDXParser,
        OpenDefinitionParser,
        OSIParser,
        ScanCodeLicenseDBParser,
        CreativeCommonsParser,
    ]


def _load_url_plugins() -> list[type]:
    from .parsers.creativecommons import CreativeCommonsParser
    from .parsers.opendefinition import OpenDefinitionParser
    from .parsers.osi import OSIParser
    from .parsers.publisher import PublisherParser
    from .parsers.spdx import SPDXParser

    return [
        SPDXParser,
        OpenDefinitionParser,
        OSIParser,
        CreativeCommonsParser,
        PublisherParser,
    ]


def _load_alias_plugins() -> list[type]:
    from .parsers.alias import AliasParser
    from .parsers.publisher import PublisherParser

    # PublisherParser first, then AliasParser - AliasParser values take precedence
    return [PublisherParser, AliasParser]


def _load_family_plugins() -> list[type]:
    from .parsers.alias import AliasParser

    return [AliasParser]


def _load_name_plugins() -> list[type]:
    from .parsers.alias import AliasParser

    return [AliasParser]


def _load_prose_plugins() -> list[type]:
    from .parsers.prose import ProseParser

    return [ProseParser]


# Lazy-loaded bundle - functions delay imports until actually needed
class _LazyDefaults:
    """Lazy-loading container for default plugins."""

    _registry: list[type] | None = None
    _url: list[type] | None = None
    _alias: list[type] | None = None
    _family: list[type] | None = None
    _name: list[type] | None = None
    _prose: list[type] | None = None

    @property
    def registry(self) -> list[type]:
        if self._registry is None:
            self._registry = _load_registry_plugins()
        return self._registry

    @property
    def url(self) -> list[type]:
        if self._url is None:
            self._url = _load_url_plugins()
        return self._url

    @property
    def alias(self) -> list[type]:
        if self._alias is None:
            self._alias = _load_alias_plugins()
        return self._alias

    @property
    def family(self) -> list[type]:
        if self._family is None:
            self._family = _load_family_plugins()
        return self._family

    @property
    def name(self) -> list[type]:
        if self._name is None:
            self._name = _load_name_plugins()
        return self._name

    @property
    def prose(self) -> list[type]:
        if self._prose is None:
            self._prose = _load_prose_plugins()
        return self._prose


_LAZY = _LazyDefaults()


# Convenience accessors - these trigger lazy loading
def get_default_registry() -> list[type]:
    return _LAZY.registry


def get_default_url() -> list[type]:
    return _LAZY.url


def get_default_alias() -> list[type]:
    return _LAZY.alias


def get_default_family() -> list[type]:
    return _LAZY.family


def get_default_name() -> list[type]:
    return _LAZY.name


def get_default_prose() -> list[type]:
    return _LAZY.prose


class _LazyPluginsBundle:
    """Lazy-loading bundle - defers plugin loading until accessed."""

    _cache: dict[str, list[type]] = {}

    def _get_registry(self) -> list[type]:
        return get_default_registry()

    def _get_url(self) -> list[type]:
        return get_default_url()

    def _get_alias(self) -> list[type]:
        return get_default_alias()

    def _get_family(self) -> list[type]:
        return get_default_family()

    def _get_name(self) -> list[type]:
        return get_default_name()

    def _get_prose(self) -> list[type]:
        return get_default_prose()

    def __getitem__(self, key: str) -> list[type]:
        if key not in self._cache:
            fn = getattr(self, f"_get_{key}", None)
            if fn is None:
                raise KeyError(key)
            self._cache[key] = fn()
        return self._cache[key]


_DEFAULT_PLUGINS_BUNDLE = _LazyPluginsBundle()


class _DefaultPlugins(Mapping):
    """Lazy dict-like accessor for default plugins."""

    def __getitem__(self, key: str) -> list[type]:
        return _DEFAULT_PLUGINS_BUNDLE[key]

    def keys(self) -> tuple[str, ...]:
        return DEFAULT_PLUGIN_KEYS

    def values(self) -> list[list[type]]:
        return [self[k] for k in self.keys()]

    def items(self) -> list[tuple[str, list[type]]]:
        return [(k, self[k]) for k in self.keys()]

    def __iter__(self) -> Iterator[str]:
        return iter(self.keys())

    def __len__(self) -> int:
        return 6

    def __contains__(self, key: str) -> bool:
        return key in self.keys()

    def copy(self) -> dict:
        return dict(self.items())


DEFAULT_PLUGINS = _DefaultPlugins()

src/licence_normaliser/exceptions.py

"""licence_normaliser.exceptions - public exception hierarchy.

These are the only exceptions that cross the public API boundary.
All internal errors are wrapped before propagation.
"""

from __future__ import annotations

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
    "DataSourceError",
    "LicenseNormalisationError",
    "LicenseNormaliserError",
    "LicenseNotFoundError",
)


class LicenseNormaliserError(Exception):
    """Base exception for all licence-normaliser errors."""


class LicenseNotFoundError(LicenseNormaliserError):
    """Raised in strict mode when a license string cannot be resolved."""

    def __init__(self, raw: str, cleaned: str) -> None:
        self.raw = raw
        self.cleaned = cleaned
        super().__init__(
            f"License not found: {raw!r} (cleaned: {cleaned!r}). "
            "Pass strict=False to return an 'unknown' result instead."
        )


class DataSourceError(LicenseNormaliserError):
    """Raised when a data source file cannot be loaded or parsed."""


class LicenseNormalisationError(ValueError):
    """Raised when ``strict=True`` and no canonical license could be resolved."""

src/licence_normaliser/parsers/init.py

src/licence_normaliser/parsers/__init__.py

src/licence_normaliser/parsers/alias.py

"""Alias parser - loads aliases.json with rich metadata for aliases/family overrides.

Each entry may carry an optional ``aliases`` list of extra lookup keys that all
resolve to the same ``version_key``.  This lets data authors enumerate explicit
variants (e.g. hyphen vs space forms) without any auto-generation magic::

    "cc by-nc": {
        "version_key": "cc-by-nc",
        "name_key": "cc-by-nc",
        "family_key": "cc",
        "aliases": ["cc-by-nc", "cc by nc", "cc-by nc"]
    }

All keys in ``aliases`` inherit the same ``version_key``, ``name_key``, and
``family_key`` as the primary entry.
"""

from __future__ import annotations

import json
from pathlib import Path
from typing import Any

from licence_normaliser.plugins import AliasPlugin, BasePlugin, FamilyPlugin, NamePlugin

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("AliasParser",)


def _iter_entries(
    data: dict[str, Any],
) -> list[tuple[str, dict[str, Any]]]:
    """Yield (key, meta) pairs, expanding ``aliases`` sub-keys.

    For every primary entry that has an ``"aliases"`` list, each alias key is
    emitted as an additional entry with the same metadata dict (minus the
    ``aliases`` field itself, to keep things tidy).
    """
    results: list[tuple[str, dict[str, Any]]] = []
    for primary_key, meta in data.items():
        if primary_key.startswith("_"):
            continue
        if not isinstance(meta, dict):
            continue
        version_key = meta.get("version_key", "")
        if not version_key:
            continue
        results.append((primary_key, meta))

        # Expand explicit alias variants
        for extra_key in meta.get("aliases", []):
            if not isinstance(extra_key, str) or not extra_key:
                continue
            if extra_key == primary_key:
                continue  # already emitted
            # Build a slim copy without the aliases list to avoid recursion
            slim_meta = {k: v for k, v in meta.items() if k != "aliases"}
            results.append((extra_key, slim_meta))

    return results


class AliasParser(BasePlugin, AliasPlugin, FamilyPlugin, NamePlugin):
    url = None
    local_path = "data/aliases/aliases.json"

    def _load_data(self) -> dict[str, Any]:
        path = Path(__file__).parent.parent / self.local_path
        return json.loads(path.read_text(encoding="utf-8"))

    def parse(self) -> list[tuple[str, dict[str, Any]]]:
        return _iter_entries(self._load_data())

    def load_aliases(self) -> dict[str, str]:
        aliases: dict[str, str] = {}
        for alias_key, meta in _iter_entries(self._load_data()):
            version_key = meta.get("version_key", "")
            if version_key:
                aliases[alias_key] = version_key
        return aliases

    def load_aliases_with_lines(
        self,
    ) -> dict[str, tuple[str, int]]:
        """Load aliases with their source line numbers.

        Extra keys from ``aliases`` lists are reported at the line of their
        primary entry (best approximation without per-alias line tracking).

        Returns:
            dict mapping alias_key -> (version_key, line_number)
        """
        path = Path(__file__).parent.parent / self.local_path
        content = path.read_text(encoding="utf-8")
        data: dict[str, Any] = json.loads(content)
        lines = content.splitlines()
        result: dict[str, tuple[str, int]] = {}

        for primary_key, meta in data.items():
            if primary_key.startswith("_"):
                continue
            if not isinstance(meta, dict):
                continue
            version_key = meta.get("version_key", "")
            if not version_key:
                continue

            # Find line of the primary key
            primary_line = 1
            for i, line in enumerate(lines, start=1):
                if f'"{primary_key}"' in line:
                    primary_line = i
                    break

            result[primary_key] = (version_key, primary_line)

            for extra_key in meta.get("aliases", []):
                if not isinstance(extra_key, str) or not extra_key:
                    continue
                if extra_key == primary_key:
                    continue
                result[extra_key] = (version_key, primary_line)

        return result

    def load_families(self) -> dict[str, str]:
        data = self._load_data()
        overrides: dict[str, str] = {}
        for meta in data.values():
            if not isinstance(meta, dict):
                continue
            vk = meta.get("version_key", "")
            fk = meta.get("family_key", "")
            if vk and fk:
                overrides[vk] = fk
        return overrides

    def load_names(self) -> dict[str, str]:
        data = self._load_data()
        names: dict[str, str] = {}
        for meta in data.values():
            if not isinstance(meta, dict):
                continue
            vk = meta.get("version_key", "")
            nk = meta.get("name_key", "")
            if vk and nk:
                names[vk] = nk
        return names

src/licence_normaliser/parsers/creativecommons.py

"""Creative Commons parser - scrapes creativecommons.org for multilingual deed URLs."""

from __future__ import annotations

import json
import re
import urllib.request
from html.parser import HTMLParser
from pathlib import Path
from typing import Any

from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

CC_LICENSE_RE = re.compile(
    r"^(by|by-nc|by-nc-nd|by-nc-sa|by-nd|by-sa|"
    r"zero|pdmark|devnations|"
    r"nc|nd|sa|sampling|nc-sa|sampling\+|nc-sampling\+|nd-nc)"
    r"/([\d.]+)"
    r"(/igo)?"
    r"(/deed\.\w+)?$",
)
VERSION_RE = re.compile(r"^[\d.]+$")


def _path_to_license_key(path: str) -> str | None:
    m = CC_LICENSE_RE.match(path)
    if not m:
        return None
    lic_type, version, igo = m.group(1), m.group(2), m.group(3)

    prefix_map = {
        "by": "cc-by",
        "by-nc": "cc-by-nc",
        "by-nc-nd": "cc-by-nc-nd",
        "by-nc-sa": "cc-by-nc-sa",
        "by-nd": "cc-by-nd",
        "by-sa": "cc-by-sa",
        "zero": "cc0",
        "pdmark": "cc-pdm",
        "devnations": "cc-devnations",
        "nc": "cc-nc",
        "nd": "cc-nd",
        "sa": "cc-sa",
        "sampling": "cc-sampling",
        "nc-sa": "cc-nc-sa",
        "sampling+": "cc-sampling-plus",
        "nc-sampling+": "cc-nc-sampling-plus",
        "nd-nc": "cc-nd-nc",
    }
    prefix = prefix_map.get(lic_type)
    if not prefix:
        return None
    suffix = "igo" if igo else ""
    key = f"{prefix}-{version}" if VERSION_RE.match(version) else prefix
    if suffix:
        key = f"{key}-{suffix}"
    return key.lower()


class CCLinkParser(HTMLParser):
    def __init__(self) -> None:
        super().__init__()
        self.in_td = False
        self.current_cell = ""
        self.current_row: list[str] = []
        self.rows: list[list[str]] = []

    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
        if tag == "td":
            self.in_td = True
            self.current_cell = ""
        elif tag == "a" and self.in_td:
            href = dict(attrs).get("href") or ""
            if href:
                self.current_cell += " AHREF:" + href

    def handle_endtag(self, tag: str) -> None:
        if tag == "td":
            self.in_td = False
            self.current_row.append(self.current_cell.strip())
        elif tag == "tr":
            if self.current_row:
                self.rows.append(self.current_row)
            self.current_row = []

    def handle_data(self, data: str) -> None:
        if self.in_td:
            self.current_cell += data


def _fetch_html(url: str) -> str:
    req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
    with urllib.request.urlopen(req, timeout=30) as response:  # noqa: S310
        return response.read().decode("utf-8")


JURISDICTION_CODES = {
    "au",
    "at",
    "be",
    "br",
    "ca",
    "ch",
    "cl",
    "cn",
    "co",
    "cz",
    "de",
    "dk",
    "ee",
    "eg",
    "es",
    "fi",
    "fr",
    "gb",
    "gr",
    "hr",
    "hu",
    "id",
    "ie",
    "il",
    "in",
    "ir",
    "is",
    "it",
    "jp",
    "kr",
    "lt",
    "lu",
    "lv",
    "ma",
    "mt",
    "mx",
    "my",
    "nl",
    "no",
    "nz",
    "pe",
    "ph",
    "pl",
    "pt",
    "ro",
    "rs",
    "ru",
    "se",
    "si",
    "sk",
    "th",
    "tr",
    "tw",
    "ua",
    "ug",
    "us",
    "za",
    "vn",
}


def _is_international(href: str) -> bool:
    parts = href.split("/")
    return not any(p in JURISDICTION_CODES for p in parts[1:])


def _extract_deeds(html: str) -> set[str]:
    parser = CCLinkParser()
    parser.feed(html)
    deeds: set[str] = set()
    for row in parser.rows:
        if not row:
            continue
        jurisdiction = row[0]
        if jurisdiction != "English":
            continue
        for cell in row[1:]:
            for part in cell.split():
                if part.startswith("AHREF:"):
                    href = part[6:]
                    if href and _is_international(href):
                        deeds.add(href)
    return deeds


def _scrape() -> list[dict[str, str]]:
    pages = [
        "https://creativecommons.org/licenses/list.en",
        "https://creativecommons.org/publicdomain/list.en",
    ]
    all_deeds: set[str] = set()
    try:
        for page_url in pages:
            html = _fetch_html(page_url)
            all_deeds |= _extract_deeds(html)
    except Exception:
        pass

    entries: list[dict[str, str]] = []
    seen_keys: set[str] = set()
    for href in sorted(all_deeds):
        lic_key = _path_to_license_key(href)
        if not lic_key:
            continue
        url_path = href.rsplit("/deed.", 1)[0]
        url = f"https://creativecommons.org/licenses/{url_path}/"
        if lic_key in seen_keys:
            continue
        seen_keys.add(lic_key)
        entries.append({"license_key": lic_key, "url": url, "path": url_path})

    return entries


class CreativeCommonsParser(BasePlugin, RegistryPlugin, URLPlugin):
    id = "creativecommons"
    url = "https://creativecommons.org/licenses/list.en"
    local_path = "data/creativecommons/creativecommons.json"

    def parse(self) -> list[tuple[str, dict[str, Any]]]:
        path = Path(__file__).parent.parent / self.local_path
        if not path.exists():
            return []
        data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
        return [
            (
                entry["license_key"],
                {
                    "url": entry["url"],
                    "name": entry["license_key"],
                    "path": entry["path"],
                },
            )
            for entry in data
            if "license_key" in entry
        ]

    def load_registry(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        if not path.exists():
            return {}
        data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        for entry in data:
            key = entry.get("license_key", "")
            if key:
                result[key.lower().strip()] = key.lower().strip()
        return result

    def load_urls(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        if not path.exists():
            return {}
        data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        for entry in data:
            key = entry.get("license_key", "")
            if not key:
                continue
            canonical = key.lower().strip()
            raw_url = entry.get("url", "")
            if not raw_url:
                continue
            clean = raw_url.strip().lower().rstrip("/")
            if clean.startswith("http://"):
                clean = "https://" + clean[7:]
            result[clean] = canonical
        return result

    @classmethod
    def refresh(cls, force: bool = False) -> bool:
        target = Path(__file__).parent.parent / cls.local_path
        if target.exists() and not force:
            return True
        try:
            data = _scrape()
            target.parent.mkdir(parents=True, exist_ok=True)
            target.write_text(
                json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8"
            )
            return True
        except Exception:
            return False

src/licence_normaliser/parsers/opendefinition.py

"""OpenDefinition parser - loads opendefinition_licenses_all.json from package data."""

from __future__ import annotations

import json
from pathlib import Path
from typing import Any

from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("OpenDefinitionParser",)


class OpenDefinitionParser(BasePlugin, RegistryPlugin, URLPlugin):
    id = "opendefinition"
    url = "https://licenses.opendefinition.org/licenses/groups/all.json"
    local_path = "data/opendefinition/opendefinition.json"

    def parse(self) -> list[tuple[str, dict[str, Any]]]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        results: list[tuple[str, dict[str, Any]]] = []
        for entry in data.values():
            if not isinstance(entry, dict):
                continue
            lid = entry.get("id", "")
            url = entry.get("url", "")
            results.append((lid, {"url": url, "title": entry.get("title", "")}))
        return results

    def load_registry(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        for entry in data.values():
            if not isinstance(entry, dict):
                continue
            lid = entry.get("id", "")
            if lid:
                result[lid.lower().strip()] = lid.lower().strip()
        return result

    def load_urls(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        for entry in data.values():
            if not isinstance(entry, dict):
                continue
            lid = entry.get("id", "")
            if not lid:
                continue
            canonical = lid.lower().strip()
            raw_url = entry.get("url", "")
            if not raw_url:
                continue
            clean = raw_url.strip().lower().rstrip("/")
            if clean.startswith("http://"):
                clean = "https://" + clean[7:]
            result[clean] = canonical
        return result

src/licence_normaliser/parsers/osi.py

"""OSI parser - loads osi.json from package data."""

from __future__ import annotations

import json
from pathlib import Path
from typing import Any

from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("OSIParser",)


class OSIParser(BasePlugin, RegistryPlugin, URLPlugin):
    id = "osi"
    url = "https://opensource.org/api/license"
    local_path = "data/osi/osi.json"

    def parse(self) -> list[tuple[str, dict[str, Any]]]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        results: list[tuple[str, dict[str, Any]]] = []
        if not isinstance(data, list):
            return results
        for entry in data:
            if not isinstance(entry, dict):
                continue
            key = entry.get("id", "")
            if not key:
                continue
            links = entry.get("_links", {})
            html_link = links.get("html", {})
            url = html_link.get("href", "") if isinstance(html_link, dict) else ""
            results.append(
                (
                    key,
                    {
                        "url": url,
                        "name": entry.get("name", ""),
                        "spdx_id": entry.get("spdx_id", ""),
                    },
                )
            )
        return results

    def load_registry(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        if not isinstance(data, list):
            return result
        for entry in data:
            if not isinstance(entry, dict):
                continue
            key = entry.get("id", "").strip()
            if key:
                result[key.lower()] = key.lower()
        return result

    def load_urls(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        if not isinstance(data, list):
            return result
        for entry in data:
            if not isinstance(entry, dict):
                continue
            key = entry.get("id", "").strip()
            if not key:
                continue
            canonical = key.lower()
            links = entry.get("_links", {})
            html_link = links.get("html", {})
            raw_url = html_link.get("href", "") if isinstance(html_link, dict) else ""
            if not raw_url:
                continue
            clean = raw_url.strip().lower().rstrip("/")
            if clean.startswith("http://"):
                clean = "https://" + clean[7:]
            result[clean] = canonical
        return result

src/licence_normaliser/parsers/prose.py

"""Prose pattern parser - loads prose_patterns.json and compiles regex patterns."""

from __future__ import annotations

import json
import re
from pathlib import Path
from typing import Any

from licence_normaliser.plugins import BasePlugin, ProsePlugin

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("ProseParser",)

_COMPILED_PATTERNS: list[tuple[re.Pattern[str], str]] = []


class ProseParser(BasePlugin, ProsePlugin):
    is_registry_entry = False
    url = None
    local_path = "data/prose/prose_patterns.json"

    def parse(self) -> list[tuple[str, dict[str, Any]]]:
        path = Path(__file__).parent.parent / self.local_path
        data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
        global _COMPILED_PATTERNS
        _COMPILED_PATTERNS = []
        results: list[tuple[str, dict[str, Any]]] = []
        for entry in data:
            pattern_str = entry.get("pattern", "")
            version_key = entry.get("version_key", "")
            name_key = entry.get("name_key", "")
            family_key = entry.get("family_key", "")
            if pattern_str and version_key:
                compiled = re.compile(pattern_str, re.IGNORECASE)
                _COMPILED_PATTERNS.append((compiled, version_key))
                results.append(
                    (
                        pattern_str,
                        {
                            "pattern": compiled,
                            "version_key": version_key,
                            "name_key": name_key,
                            "family_key": family_key,
                        },
                    )
                )
        return results

    def load_prose(self) -> list[tuple[re.Pattern[str], str]]:
        global _COMPILED_PATTERNS
        _COMPILED_PATTERNS = []
        path = Path(__file__).parent.parent / self.local_path
        data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
        for entry in data:
            pattern_str = entry.get("pattern", "")
            version_key = entry.get("version_key", "")
            if pattern_str and version_key:
                compiled = re.compile(pattern_str, re.IGNORECASE)
                _COMPILED_PATTERNS.append((compiled, version_key))
        return _COMPILED_PATTERNS

    def load_prose_with_lines(self) -> list[tuple[re.Pattern[str], str, int]]:
        """Load prose patterns with their source line numbers.

        Returns:
            list of (compiled_pattern, version_key, line_number)
        """
        path = Path(__file__).parent.parent / self.local_path
        content = path.read_text(encoding="utf-8")
        data: list[dict[str, str]] = json.loads(content)
        lines = content.splitlines()
        result: list[tuple[re.Pattern[str], str, int]] = []
        for entry in data:
            pattern_str = entry.get("pattern", "")
            version_key = entry.get("version_key", "")
            if pattern_str and version_key:
                compiled = re.compile(pattern_str, re.IGNORECASE)
                serialized = json.dumps(pattern_str)
                line_num = 1
                for i, line in enumerate(lines, start=1):
                    if '"pattern"' in line and serialized[:30] in line:
                        line_num = i
                        break
                result.append((compiled, version_key, line_num))
        return result


def get_prose_patterns() -> list[tuple[re.Pattern[str], str]]:
    """Legacy helper: return the compiled prose patterns."""
    return _COMPILED_PATTERNS

src/licence_normaliser/parsers/publisher.py

"""Publisher parser - loads publishers.json with URLs and shorthand aliases."""

from __future__ import annotations

import json
from pathlib import Path
from typing import Any

from licence_normaliser.plugins import AliasPlugin, BasePlugin, URLPlugin

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("PublisherParser",)


class PublisherParser(BasePlugin, AliasPlugin, URLPlugin):
    url = None
    local_path = "data/publishers/publishers.json"

    def parse(self) -> list[tuple[str, dict[str, Any]]]:
        path = Path(__file__).parent.parent / self.local_path
        data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8"))
        results: list[tuple[str, dict[str, Any]]] = []
        urls: dict[str, dict[str, str]] = data.get("urls", {})
        for url, meta in urls.items():
            if isinstance(meta, dict):
                results.append((url, meta))
        return results

    def load_aliases(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8"))
        aliases: dict[str, str] = data.get("shorthand_aliases", {})
        return dict(aliases)

    def load_aliases_with_lines(self) -> dict[str, tuple[str, int]]:
        """Load shorthand aliases with their source line numbers."""
        path = Path(__file__).parent.parent / self.local_path
        content = path.read_text(encoding="utf-8")
        data: dict[str, Any] = json.loads(content)
        lines = content.splitlines()
        result: dict[str, tuple[str, int]] = {}
        for alias_key, version_key in data.get("shorthand_aliases", {}).items():
            for i, line in enumerate(lines, start=1):
                if f'"{alias_key}"' in line:
                    result[alias_key] = (version_key, i)
                    break
        return result

    def load_urls(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        for url, meta in data.get("urls", {}).items():
            if not isinstance(meta, dict):
                continue
            vk = meta.get("version_key", "")
            if not vk:
                continue
            clean = url.strip().lower().rstrip("/")
            if clean.startswith("http://"):
                clean = "https://" + clean[7:]
            result[clean] = vk
        return result

    def load_urls_with_lines(self) -> dict[str, tuple[str, int]]:
        """Load URLs with their source line numbers."""
        path = Path(__file__).parent.parent / self.local_path
        content = path.read_text(encoding="utf-8")
        data: dict[str, Any] = json.loads(content)
        lines = content.splitlines()
        result: dict[str, tuple[str, int]] = {}
        for url, meta in data.get("urls", {}).items():
            if not isinstance(meta, dict):
                continue
            vk = meta.get("version_key", "")
            if not vk:
                continue
            clean = url.strip().lower().rstrip("/")
            if clean.startswith("http://"):
                clean = "https://" + clean[7:]
            for i, line in enumerate(lines, start=1):
                if f'"{url}"' in line or f'"{clean}"' in line:
                    result[clean] = (vk, i)
                    break
        return result

src/licence_normaliser/parsers/scancode_licensedb.py

"""ScanCode-licensedb parser - loads scancode_licensedb.json from package data."""

from __future__ import annotations

import json
from pathlib import Path
from typing import Any

from licence_normaliser.plugins import BasePlugin, RegistryPlugin

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("ScanCodeLicenseDBParser",)


class ScanCodeLicenseDBParser(BasePlugin, RegistryPlugin):
    id = "scancode-licensedb"
    url = "https://scancode-licensedb.aboutcode.org/index.json"
    local_path = "data/scancode_licensedb/scancode_licensedb.json"

    def parse(self) -> list[tuple[str, dict[str, Any]]]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        results: list[tuple[str, dict[str, Any]]] = []
        if not isinstance(data, list):
            return results
        for entry in data:
            if not isinstance(entry, dict):
                continue
            key = entry.get("license_key", "")
            if not key:
                continue
            if key.lower() == "unknown":
                continue
            spdx_key = entry.get("spdx_license_key")
            category = entry.get("category", "")
            results.append(
                (
                    key,
                    {
                        "url": "",
                        "name": key,
                        "category": category,
                        "spdx_license_key": spdx_key if spdx_key else "",
                    },
                )
            )
        return results

    def load_registry(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        if not isinstance(data, list):
            return result
        for entry in data:
            if not isinstance(entry, dict):
                continue
            key = entry.get("license_key", "")
            if key and key.lower() != "unknown":
                result[key.lower().strip()] = key.lower().strip()
        return result

src/licence_normaliser/parsers/spdx.py

"""SPDX parser - loads spdx-licenses.json from package data."""

from __future__ import annotations

import json
from pathlib import Path
from typing import Any

from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("SPDXParser",)


class SPDXParser(BasePlugin, RegistryPlugin, URLPlugin):
    id = "spdx"
    url = "https://raw.githubusercontent.com/spdx/license-list-data/main/json/licenses.json"
    local_path = "data/spdx/spdx.json"

    def parse(self) -> list[tuple[str, dict[str, Any]]]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        results: list[tuple[str, dict[str, Any]]] = []
        for entry in data.get("licenses", []):
            if not isinstance(entry, dict):
                continue
            lid = entry.get("licenseId", "")
            urls = entry.get("seeAlso", [])
            url = urls[0] if urls else ""
            results.append((lid, {"url": url, "name": entry.get("name", "")}))
        return results

    def load_registry(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        for entry in data.get("licenses", []):
            if not isinstance(entry, dict):
                continue
            lid = entry.get("licenseId", "")
            if lid:
                result[lid.lower().strip()] = lid.lower().strip()
        return result

    def load_urls(self) -> dict[str, str]:
        path = Path(__file__).parent.parent / self.local_path
        data = json.loads(path.read_text(encoding="utf-8"))
        result: dict[str, str] = {}
        for entry in data.get("licenses", []):
            if not isinstance(entry, dict):
                continue
            lid = entry.get("licenseId", "")
            if not lid:
                continue
            canonical = lid.lower().strip()
            for raw_url in entry.get("seeAlso", []):
                if not raw_url:
                    continue
                clean = raw_url.strip().lower().rstrip("/")
                if clean.startswith("http://"):
                    clean = "https://" + clean[7:]
                result[clean] = canonical
        return result

src/licence_normaliser/plugins.py

"""Simple plugin interface definitions.

Each plugin is a callable that returns a dict or list of tuples.
Plugins are passed as CLASSES (not instances) - they're instantiated lazily.
"""

from __future__ import annotations

import json
import logging
import re
import urllib.error
import urllib.request
from pathlib import Path

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"

__all__ = (
    "AliasPlugin",
    "BasePlugin",
    "FamilyPlugin",
    "NamePlugin",
    "ProsePlugin",
    "RegistryPlugin",
    "URLPlugin",
)


class BasePlugin:
    """Base class for all plugins with refresh capability."""

    url: str | None = None
    local_path: str = ""

    @classmethod
    def refresh(cls, force: bool = False) -> bool:
        """Fetch fresh data from ``cls.url`` and write to ``cls.local_path``.

        The local path is resolved relative to the package root
        (``src/licence_normaliser/``).

        If ``cls.url`` is None, this is a local-only parser with no external
        source and the operation succeeds without fetching.

        Returns True on success, False on failure.
        """
        if not cls.local_path:
            return False
        target = Path(__file__).parent / cls.local_path
        if target.exists() and not force:
            return True
        if cls.url is None:
            return True
        try:
            with urllib.request.urlopen(cls.url, timeout=30) as response:  # noqa: S310
                raw_bytes = response.read()
            json.loads(raw_bytes.decode("utf-8"))
            target.parent.mkdir(parents=True, exist_ok=True)
            target.write_bytes(raw_bytes)
            return True
        except urllib.error.URLError as exc:
            logging.warning(
                "refresh(%s): URLError fetching %s - %s", cls.__name__, cls.url, exc
            )
            return False
        except urllib.error.HTTPError as exc:
            logging.warning(
                "refresh(%s): HTTPError %s fetching %s", cls.__name__, exc.code, cls.url
            )
            return False
        except json.JSONDecodeError as exc:
            logging.error(
                "refresh(%s): invalid JSON from %s - %s", cls.__name__, cls.url, exc
            )
            return False
        except OSError as exc:
            logging.error(
                "refresh(%s): OSError writing %s - %s", cls.__name__, target, exc
            )
            return False


class RegistryPlugin:
    """Returns key -> canonical_key mappings."""

    def load_registry(self) -> dict[str, str]:
        raise NotImplementedError


class URLPlugin:
    """Returns cleaned_url -> version_key mappings."""

    def load_urls(self) -> dict[str, str]:
        raise NotImplementedError


class AliasPlugin:
    """Returns alias_string -> version_key mappings."""

    def load_aliases(self) -> dict[str, str]:
        raise NotImplementedError


class FamilyPlugin:
    """Returns version_key -> family_key mappings."""

    def load_families(self) -> dict[str, str]:
        raise NotImplementedError


class NamePlugin:
    """Returns version_key -> name_key mappings."""

    def load_names(self) -> dict[str, str]:
        raise NotImplementedError


class ProsePlugin:
    """Returns list of (compiled_pattern, version_key) for prose matching."""

    def load_prose(self) -> list[tuple[re.Pattern[str], str]]:
        raise NotImplementedError

src/licence_normaliser/tests/init.py

src/licence_normaliser/tests/__init__.py

"""Tests for licence_normaliser."""

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"

src/licence_normaliser/tests/conftest.py

"""Shared fixtures for licence_normaliser tests."""

import pytest

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


@pytest.fixture()
def mit_raw() -> str:
    return "MIT"


@pytest.fixture()
def cc_by_nc_nd_4_raw() -> str:
    return "CC BY-NC-ND 4.0"


@pytest.fixture()
def batch_raw() -> list[str]:
    return ["MIT", "Apache-2.0", "CC BY 4.0"]

src/licence_normaliser/tests/test_aliases.py

"""Tests for AliasParser - non-CC aliases (Apache, MIT, BSD, GPL, etc.)."""

from licence_normaliser import normalise_license


class TestNonCCAliases:
    def test_apache_shorthand(self):
        v = normalise_license("apache")
        assert v.key == "apache-2.0"
        assert v.family.key == "osi"

    def test_apache_license(self):
        v = normalise_license("apache license")
        assert v.key == "apache-2.0"
        assert v.family.key == "osi"

    def test_apache_2(self):
        v = normalise_license("apache 2")
        assert v.key == "apache-2.0"
        assert v.family.key == "osi"

    def test_apache_2_0(self):
        v = normalise_license("apache 2.0")
        assert v.key == "apache-2.0"
        assert v.family.key == "osi"

    def test_mit_license(self):
        v = normalise_license("mit license")
        assert v.key == "mit"
        assert v.family.key == "osi"

    def test_the_mit_license(self):
        v = normalise_license("the mit license")
        assert v.key == "mit"
        assert v.family.key == "osi"

    def test_bsd_shorthand(self):
        v = normalise_license("bsd")
        assert v.key == "bsd-3-clause"
        assert v.family.key == "osi"

    def test_bsd_license(self):
        v = normalise_license("bsd license")
        assert v.key == "bsd-3-clause"
        assert v.family.key == "osi"

    def test_mozilla(self):
        v = normalise_license("mozilla")
        assert v.key == "mpl-2.0"
        assert v.family.key == "osi"

    def test_isc_license(self):
        v = normalise_license("isc license")
        assert v.key == "isc"
        assert v.family.key == "osi"

    def test_gpl_shorthand(self):
        v = normalise_license("gpl")
        assert v.key == "gpl-3.0"
        assert v.family.key == "copyleft"

    def test_gnu_gpl(self):
        v = normalise_license("gnu gpl")
        assert v.key == "gpl-3.0"
        assert v.family.key == "copyleft"

    def test_gnu_gpl_v2(self):
        v = normalise_license("gnu gpl v2")
        assert v.key == "gpl-2.0"
        assert v.family.key == "copyleft"

    def test_gpl_3_0_or_later(self):
        v = normalise_license("gpl-3.0+")
        assert v.key == "gpl-3.0"
        assert v.family.key == "copyleft"

    def test_gpl_2_0_or_later(self):
        v = normalise_license("gpl-2.0+")
        assert v.key == "gpl-2.0"
        assert v.family.key == "copyleft"

    def test_agpl_shorthand(self):
        v = normalise_license("agpl")
        assert v.key == "agpl-3.0"
        assert v.family.key == "copyleft"

    def test_agpl_3_0_or_later(self):
        v = normalise_license("agpl-3.0+")
        assert v.key == "agpl-3.0"
        assert v.family.key == "copyleft"

    def test_lgpl_shorthand(self):
        v = normalise_license("lgpl")
        assert v.key == "lgpl-3.0"
        assert v.family.key == "copyleft"

    def test_lgpl_2_1_or_later(self):
        v = normalise_license("lgpl-2.1+")
        assert v.key == "lgpl-2.1"
        assert v.family.key == "copyleft"

    def test_lgpl_3_0_or_later(self):
        v = normalise_license("lgpl-3.0+")
        assert v.key == "lgpl-3.0"
        assert v.family.key == "copyleft"

    def test_unlicense(self):
        v = normalise_license("unlicense")
        assert v.key == "unlicense"
        assert v.family.key == "osi"

    def test_wtfpl(self):
        v = normalise_license("wtfpl")
        assert v.key == "wtfpl"
        assert v.family.key == "osi"

    def test_zlib(self):
        v = normalise_license("zlib")
        assert v.key == "zlib"
        assert v.family.key == "osi"

    def test_open_database_license(self):
        v = normalise_license("open database license")
        assert v.key == "odbl"
        assert v.family.key == "open-data"

    def test_public_domain(self):
        v = normalise_license("public domain")
        assert v.key == "public-domain"
        assert v.family.key == "public-domain"

    def test_pd_alias(self):
        v = normalise_license("pd")
        assert v.key == "public-domain"
        assert v.family.key == "public-domain"

src/licence_normaliser/tests/test_cache.py

"""Tests for _cache.py - thread-safe default normaliser singleton."""

from __future__ import annotations

import threading
from concurrent.futures import ThreadPoolExecutor

from licence_normaliser._cache import (
    _DefaultNormaliser,
    get_registry_keys,
    normalise_license,
    normalise_licenses,
)
from licence_normaliser._normaliser import LicenseNormaliser


class TestDefaultNormaliserSingleton:
    def test_singleton_instance_reused(self) -> None:
        d1 = _DefaultNormaliser()
        d2 = _DefaultNormaliser()
        assert d1.get() is d2.get()

    def test_get_returns_licence_normaliser(self) -> None:
        d = _DefaultNormaliser()
        instance = d.get()
        assert isinstance(instance, LicenseNormaliser)

    def test_thread_safety_same_instance(self) -> None:
        results: list[object | None] = [None] * 20
        errors: list[BaseException | None] = [None] * 20

        def get_instance(idx: int) -> None:
            try:
                d = _DefaultNormaliser()
                results[idx] = d.get()
            except BaseException as e:  # noqa: BLE001
                errors[idx] = e

        threads = [threading.Thread(target=get_instance, args=(i,)) for i in range(20)]
        for t in threads:
            t.start()
        for t in threads:
            t.join()

        assert all(e is None for e in errors)
        assert results[0] is not None
        assert all(r is results[0] for r in results if r is not None)

    def test_concurrent_normalise_license(self) -> None:
        licenses = ["MIT", "Apache-2.0", "CC BY 4.0", "GPL-3.0", "BSD-3-Clause"]

        def normalise(lic: str) -> str:
            v = normalise_license(lic)
            return v.key

        with ThreadPoolExecutor(max_workers=10) as executor:
            futures = [executor.submit(normalise, lic) for lic in licenses * 4]
            results = [f.result(timeout=5) for f in futures]

        assert len(results) == len(licenses) * 4
        assert set(results) == {
            "mit",
            "apache-2.0",
            "cc-by-4.0",
            "gpl-3.0",
            "bsd-3-clause",
        }


class TestModuleLevelAPI:
    def test_normalise_license_returns_license_version(self) -> None:
        v = normalise_license("MIT")
        assert str(v) == "mit"

    def test_normalise_licenses_returns_list(self) -> None:
        results = normalise_licenses(["MIT", "Apache-2.0"])
        assert len(results) == 2
        assert all(str(r) in ("mit", "apache-2.0") for r in results)

    def test_get_registry_keys_returns_set_of_strings(self) -> None:
        keys = get_registry_keys()
        assert isinstance(keys, set)
        assert len(keys) > 1000
        assert "mit" in keys
        assert "apache-2.0" in keys

src/licence_normaliser/tests/test_cli.py

"""Tests for licence_normaliser CLI - includes new --strict flag."""

from unittest.mock import patch

import pytest

from licence_normaliser.cli._main import main

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


class TestNormaliseCommand:
    def test_normalise_mit(self, capsys):
        with patch("sys.argv", ["licence-normaliser", "normalise", "MIT"]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0
        assert capsys.readouterr().out.strip() == "mit"

    def test_normalise_full(self, capsys):
        with patch(
            "sys.argv", ["licence-normaliser", "normalise", "--full", "CC BY 4.0"]
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0
        out = capsys.readouterr().out
        assert "Key: cc-by-4.0" in out
        assert "License: cc-by" in out
        assert "Family: cc" in out

    def test_normalise_cc_url(self, capsys):
        with patch(
            "sys.argv",
            [
                "licence-normaliser",
                "normalise",
                "http://creativecommons.org/licenses/by/4.0/",
            ],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0
        assert capsys.readouterr().out.strip() == "cc-by-4.0"

    def test_normalise_unknown(self, capsys):
        with patch(
            "sys.argv", ["licence-normaliser", "normalise", "totally-unknown-xyz"]
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0
        assert "totally-unknown-xyz" in capsys.readouterr().out

    def test_normalise_strict_known(self, capsys):
        with patch("sys.argv", ["licence-normaliser", "normalise", "--strict", "MIT"]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0
        assert capsys.readouterr().out.strip() == "mit"

    def test_normalise_strict_unknown_exits_1(self, capsys):
        with patch(
            "sys.argv",
            ["licence-normaliser", "normalise", "--strict", "totally-unknown-xyz-9999"],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 1
        assert capsys.readouterr().err  # error message on stderr


class TestBatchCommand:
    def test_batch_basic(self, capsys):
        with patch(
            "sys.argv",
            ["licence-normaliser", "batch", "MIT", "Apache-2.0", "CC BY 4.0"],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0
        out = capsys.readouterr().out
        assert "MIT: mit" in out
        assert "Apache-2.0: apache-2.0" in out
        assert "CC BY 4.0: cc-by-4.0" in out

    def test_batch_strict_all_known(self, capsys):
        with patch(
            "sys.argv", ["licence-normaliser", "batch", "--strict", "MIT", "GPL-3.0"]
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0

    def test_batch_strict_with_unknown_exits_1(self, capsys):
        with patch(
            "sys.argv",
            ["licence-normaliser", "batch", "--strict", "MIT", "no-such-license-xyz"],
        ):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 1


class TestVersionFlag:
    def test_version_flag(self, capsys):
        with patch("sys.argv", ["licence-normaliser", "--version"]):
            with pytest.raises(SystemExit) as exc_info:
                main()
            assert exc_info.value.code == 0
        assert "licence-normaliser" in capsys.readouterr().out

src/licence_normaliser/tests/test_core.py

"""End-to-end pipeline tests via the public API."""

from licence_normaliser import normalise_license, normalise_licenses

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


class TestDirectLookup:
    def test_mit(self):
        v = normalise_license("mit")
        assert v.key == "mit"
        assert v.family.key == "osi"

    def test_apache(self):
        v = normalise_license("apache-2.0")
        assert v.key == "apache-2.0"
        assert v.family.key == "osi"

    def test_cc_by_4_0(self):
        v = normalise_license("cc-by-4.0")
        assert v.key == "cc-by-4.0"
        assert v.family.key == "cc"

    def test_cc_by_nc_nd_4_0(self):
        v = normalise_license("cc-by-nc-nd-4.0")
        assert v.key == "cc-by-nc-nd-4.0"
        assert v.family.key == "cc"

    def test_cc0_1_0(self):
        v = normalise_license("cc0-1.0")
        assert v.key == "cc0-1.0"
        assert v.family.key == "cc0"

    def test_gpl_3_0(self):
        v = normalise_license("gpl-3.0")
        assert v.key == "gpl-3.0"
        assert v.family.key == "copyleft"

    def test_gpl_2_0_only(self):
        v = normalise_license("gpl-2.0-only")
        assert v.key == "gpl-2.0-only"
        assert v.family.key == "copyleft"

    def test_lgpl_2_1(self):
        v = normalise_license("lgpl-2.1")
        assert v.key == "lgpl-2.1"
        assert v.family.key == "copyleft"

    def test_agpl_3_0(self):
        v = normalise_license("agpl-3.0")
        assert v.key == "agpl-3.0"
        assert v.family.key == "copyleft"

    def test_bsd_3_clause(self):
        v = normalise_license("bsd-3-clause")
        assert v.key == "bsd-3-clause"
        assert v.family.key == "osi"

    def test_isc(self):
        v = normalise_license("isc")
        assert v.key == "isc"
        assert v.family.key == "osi"

    def test_mpl_2_0(self):
        v = normalise_license("mpl-2.0")
        assert v.key == "mpl-2.0"
        assert v.family.key == "osi"

    def test_unlicense(self):
        v = normalise_license("unlicense")
        assert v.key == "unlicense"
        assert v.family.key == "osi"

    def test_wtfpl(self):
        v = normalise_license("wtfpl")
        assert v.key == "wtfpl"
        assert v.family.key == "osi"

    def test_zlib(self):
        v = normalise_license("zlib")
        assert v.key == "zlib"
        assert v.family.key == "osi"

    def test_odbl_1_0(self):
        v = normalise_license("odbl-1.0")
        assert v.key == "odbl-1.0"
        assert v.family.key == "open-data"

    def test_pddl_1_0(self):
        v = normalise_license("pddl-1.0")
        assert v.key == "pddl-1.0"
        assert v.family.key == "data"

    def test_odc_by_1_0(self):
        v = normalise_license("odc-by-1.0")
        assert v.key == "odc-by-1.0"
        assert v.family.key == "open-data"

    def test_unknown(self):
        v = normalise_license("unknown")
        assert v.key == "unknown"
        assert v.family.key == "unknown"

    def test_case_insensitive(self):
        v = normalise_license("MIT")
        assert v.key == "mit"
        v = normalise_license("Apache-2.0")
        assert v.key == "apache-2.0"


class TestBuiltinAliases:
    def test_cc_by(self):
        assert normalise_license("CC BY").key == "cc-by"

    def test_cc_by_4_0(self):
        assert normalise_license("CC BY 4.0").key == "cc-by-4.0"

    def test_cc_by_nc_nd_4_0(self):
        assert normalise_license("CC BY-NC-ND 4.0").key == "cc-by-nc-nd-4.0"

    def test_cc_by_nc_sa_4_0(self):
        assert normalise_license("CC BY-NC-SA 4.0").key == "cc-by-nc-sa-4.0"

    def test_cc0_1_0(self):
        assert normalise_license("CC0 1.0").key == "cc0-1.0"

    def test_public_domain(self):
        assert normalise_license("public domain").key == "public-domain"


class TestUrlLookup:
    def test_cc_by_https(self):
        v = normalise_license("https://creativecommons.org/licenses/by/4.0/")
        assert v.key == "cc-by-4.0"

    def test_cc_by_http(self):
        v = normalise_license("http://creativecommons.org/licenses/by/4.0/")
        assert v.key == "cc-by-4.0"

    def test_cc_by_no_trailing_slash(self):
        v = normalise_license("https://creativecommons.org/licenses/by/4.0")
        assert v.key == "cc-by-4.0"

    def test_mit_url(self):
        v = normalise_license("https://opensource.org/licenses/MIT")
        assert v.key == "mit"


class TestFamilyInference:
    def test_cc_family(self):
        v = normalise_license("cc-by-4.0")
        assert v.family.key == "cc"

    def test_cc0_family(self):
        v = normalise_license("cc0-1.0")
        assert v.family.key == "cc0"

    def test_copyleft_family(self):
        assert normalise_license("gpl-3.0").family.key == "copyleft"
        assert normalise_license("agpl-3.0").family.key == "copyleft"
        assert normalise_license("lgpl-2.1").family.key == "copyleft"

    def test_osi_family(self):
        assert normalise_license("mit").family.key == "osi"
        assert normalise_license("apache-2.0").family.key == "osi"
        assert normalise_license("bsd-3-clause").family.key == "osi"

    def test_data_family(self):
        assert normalise_license("pddl-1.0").family.key == "data"


class TestNameInference:
    def test_cc_name_strips_version(self):
        assert normalise_license("cc-by-4.0").license.key == "cc-by"
        assert normalise_license("cc-by-nc-nd-4.0").license.key == "cc-by-nc-nd"
        assert normalise_license("cc-by-sa-3.0").license.key == "cc-by-sa"
        assert normalise_license("cc0-1.0").license.key == "cc0"
        assert normalise_license("cc-by-nc-sa-4.0").license.key == "cc-by-nc-sa"

    def test_non_cc_keeps_key(self):
        assert normalise_license("mit").license.key == "mit"
        assert normalise_license("gpl-3.0").license.key == "gpl-3"


class TestHierarchyNavigation:
    def test_version_license_family_chain(self):
        v = normalise_license("CC BY-NC-ND 4.0")
        assert v.key == "cc-by-nc-nd-4.0"
        assert v.license.key == "cc-by-nc-nd"
        assert v.license.family.key == "cc"
        assert v.family.key == "cc"

    def test_str_representations(self):
        v = normalise_license("CC BY-NC-ND 4.0")
        assert str(v) == "cc-by-nc-nd-4.0"
        assert str(v.license) == "cc-by-nc-nd"
        assert str(v.family) == "cc"


class TestFallback:
    def test_unknown_string(self):
        v = normalise_license("some-totally-unknown-license-xyz")
        assert v.key == "some-totally-unknown-license-xyz"
        assert v.family.key == "unknown"

    def test_empty_string(self):
        v = normalise_license("")
        assert v.key == "unknown"

    def test_whitespace_only(self):
        v = normalise_license("   ")
        assert v.key == "unknown"


class TestBatchNormalisation:
    def test_basic_batch(self):
        results = normalise_licenses(["MIT", "Apache-2.0", "CC BY 4.0"])
        assert [r.key for r in results] == ["mit", "apache-2.0", "cc-by-4.0"]

    def test_batch_preserves_order(self):
        raw = ["GPL-3.0", "MIT", "CC BY 4.0", "Apache-2.0"]
        expected = ["gpl-3.0", "mit", "cc-by-4.0", "apache-2.0"]
        assert [r.key for r in normalise_licenses(raw)] == expected

    def test_batch_accepts_generator(self):
        results = normalise_licenses(x for x in ["MIT", "ISC"])
        assert results[0].key == "mit"

    def test_batch_empty(self):
        assert normalise_licenses([]) == []

src/licence_normaliser/tests/test_exceptions.py

"""Tests for strict mode and the public exception hierarchy."""

import pytest

from licence_normaliser import normalise_license, normalise_licenses
from licence_normaliser.exceptions import (
    LicenseNormaliserError,
    LicenseNotFoundError,
)

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


class TestLicenseNotFoundError:
    def test_is_subclass_of_base(self):
        assert issubclass(LicenseNotFoundError, LicenseNormaliserError)

    def test_is_subclass_of_exception(self):
        assert issubclass(LicenseNotFoundError, Exception)

    def test_attributes(self):
        exc = LicenseNotFoundError("My License", "my license")
        assert exc.raw == "My License"
        assert exc.cleaned == "my license"

    def test_str_contains_raw(self):
        exc = LicenseNotFoundError("My License", "my license")
        assert "My License" in str(exc)

    def test_str_mentions_strict_false(self):
        exc = LicenseNotFoundError("x", "x")
        assert "strict=False" in str(exc)


class TestStrictModeNormalise:
    def test_known_license_no_raise(self):
        # Known licenses must not raise in strict mode
        v = normalise_license("MIT", strict=True)
        assert v.key == "mit"

    def test_unknown_raises_license_not_found(self):
        with pytest.raises(LicenseNotFoundError) as exc_info:
            normalise_license("totally-unknown-xyz-9999", strict=True)
        assert exc_info.value.raw == "totally-unknown-xyz-9999"
        assert exc_info.value.cleaned == "totally-unknown-xyz-9999"

    def test_empty_string_raises(self):
        with pytest.raises(LicenseNotFoundError):
            normalise_license("", strict=True)

    def test_whitespace_only_raises(self):
        with pytest.raises(LicenseNotFoundError):
            normalise_license("   ", strict=True)

    def test_cc_url_known_no_raise(self):
        v = normalise_license(
            "https://creativecommons.org/licenses/by/4.0/", strict=True
        )
        assert v.key == "cc-by-4.0"

    def test_strict_false_unknown_returns_unknown(self):
        # Default (strict=False): silently returns unknown
        v = normalise_license("no-such-license-xyzzy", strict=False)
        assert v.family.key == "unknown"

    def test_strict_default_is_false(self):
        # Calling without strict kwarg should not raise
        v = normalise_license("no-such-license-xyzzy")
        assert v.family.key == "unknown"


class TestStrictModeBatch:
    def test_all_known_no_raise(self):
        results = normalise_licenses(["MIT", "Apache-2.0"], strict=True)
        assert len(results) == 2
        assert results[0].key == "mit"
        assert results[1].key == "apache-2.0"

    def test_one_unknown_raises(self):
        with pytest.raises(LicenseNotFoundError):
            normalise_licenses(["MIT", "no-such-license-xyz"], strict=True)

    def test_non_strict_batch_with_unknown(self):
        results = normalise_licenses(["MIT", "no-such-license-xyz"], strict=False)
        assert results[0].key == "mit"
        assert results[1].family.key == "unknown"

    def test_empty_batch_strict(self):
        # Empty input should not raise even in strict mode
        assert normalise_licenses([], strict=True) == []

src/licence_normaliser/tests/test_integration.py

"""Comprehensive integration tests covering the full license matrix.

Each tuple: (input_string, expected_version_key, expected_license_key,
             expected_family_key)
"""

import pytest

from licence_normaliser import (
    LicenseNormalisationError,
    LicenseNotFoundError,
    LicenseVersion,
    normalise_license,
    normalise_licenses,
)

LICENSE_MATRIX = [
    # raw,expected_key,expected_license,expected_family
    # === OSI-approved licenses ===
    ("mit", "mit", "mit", "osi"),
    ("MIT", "mit", "mit", "osi"),
    ("  mit  ", "mit", "mit", "osi"),
    ("apache-2.0", "apache-2.0", "apache", "osi"),
    ("Apache-2.0", "apache-2.0", "apache", "osi"),
    ("Apache 2.0", "apache-2.0", "apache", "osi"),
    ("Apache License 2.0", "apache-2.0", "apache", "osi"),
    (
        "BSD 3-Clause",
        "bsd-3-clause",
        "bsd-3-clause",
        "osi",
    ),  # Resolves to bsd-3-clause/osi, matches SPDX and alias entries
    ("bsd-3-clause", "bsd-3-clause", "bsd-3-clause", "osi"),
    ("BSD License", "bsd-3-clause", "bsd-3-clause", "osi"),
    ("MPL-2.0", "mpl-2.0", "mpl", "osi"),
    ("mpl-2.0", "mpl-2.0", "mpl", "osi"),
    (
        "Mozilla Public License 2.0",
        "mpl-2.0",
        "mpl",
        "osi",
    ),  # Canonical full name of MPL-2.0, matches alias entry
    ("ISC", "isc", "isc", "osi"),
    ("isc", "isc", "isc", "osi"),
    ("ISC License", "isc", "isc", "osi"),
    ("Unlicense", "unlicense", "unlicense", "osi"),
    ("unlicense", "unlicense", "unlicense", "osi"),
    ("WTFPL", "wtfpl", "wtfpl", "osi"),
    ("wtfpl", "wtfpl", "wtfpl", "osi"),
    ("Zlib", "zlib", "zlib", "osi"),
    ("zlib", "zlib", "zlib", "osi"),
    # === GPL / AGPL / LGPL (copyleft) ===
    ("gpl-3.0", "gpl-3.0", "gpl-3", "copyleft"),
    ("GPL-3.0", "gpl-3.0", "gpl-3", "copyleft"),
    ("gpl-3.0+", "gpl-3.0", "gpl-3", "copyleft"),
    (
        "gpl-3-0",
        "gpl-3-0",
        "gpl-3-0",
        "copyleft",
    ),  # NOTE: hyphen instead of dot; resolver recognises gpl but doesn't normalise
    ("GNU GPL v3", "gpl-3.0", "gpl-3", "copyleft"),
    ("GPL v3", "gpl-3.0", "gpl-3", "copyleft"),
    ("gpl-2.0", "gpl-2.0", "gpl-2", "copyleft"),
    ("GPL v2", "gpl-2.0", "gpl-2", "copyleft"),
    ("lgpl-3.0", "lgpl-3.0", "lgpl-3", "copyleft"),
    ("LGPL-3.0", "lgpl-3.0", "lgpl-3", "copyleft"),
    ("lgpl-2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"),
    ("LGPL v2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"),
    ("lgpl v2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"),
    ("agpl-3.0", "agpl-3.0", "agpl-3", "copyleft"),
    ("AGPL v3", "agpl-3.0", "agpl-3", "copyleft"),
    # === Creative Commons ===
    ("CC BY 4.0", "cc-by-4.0", "cc-by", "cc"),
    ("cc by 4.0", "cc-by-4.0", "cc-by", "cc"),
    ("cc-by-4.0", "cc-by-4.0", "cc-by", "cc"),
    ("CC BY 3.0", "cc-by-3.0", "cc-by", "cc"),
    ("cc by 3.0", "cc-by-3.0", "cc-by", "cc"),
    ("cc-by-3.0", "cc-by-3.0", "cc-by", "cc"),
    ("CC BY 2.5", "cc-by-2.5", "cc-by", "cc"),
    ("CC BY 2.0", "cc-by-2.0", "cc-by", "cc"),
    ("CC BY 1.0", "cc-by-1.0", "cc-by", "cc"),
    ("cc by", "cc-by", "cc-by", "cc"),
    (
        "CC-BY",
        "cc-by",
        "cc-by",
        "cc",
    ),  # SPDX form, resolves to cc-by/cc
    ("CC BY-NC 4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"),
    ("cc by-nc 4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"),
    ("cc-by-nc-4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"),
    ("CC BY-NC 3.0", "cc-by-nc-3.0", "cc-by-nc", "cc"),
    ("CC BY-NC-SA 4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"),
    ("cc by-nc-sa 4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"),
    ("cc-by-nc-sa-4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"),
    ("CC BY-NC-SA 3.0", "cc-by-nc-sa-3.0", "cc-by-nc-sa", "cc"),
    ("CC BY-NC-ND 4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
    ("cc by-nc-nd 4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
    ("cc-by-nc-nd-4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
    ("CC BY-NC-ND 3.0", "cc-by-nc-nd-3.0", "cc-by-nc-nd", "cc"),
    ("cc by-nc-nd 3.0", "cc-by-nc-nd-3.0", "cc-by-nc-nd", "cc"),
    ("CC BY-ND 4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"),
    ("cc by-nd 4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"),
    ("cc-by-nd-4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"),
    ("CC BY-SA 4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"),
    ("cc by-sa 4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"),
    ("cc-by-sa-4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"),
    ("CC BY-SA 3.0", "cc-by-sa-3.0", "cc-by-sa", "cc"),
    ("cc-by-3.0-igo", "cc-by-3.0-igo", "cc-by", "cc"),
    ("cc-by-nc-nd-3.0-igo", "cc-by-nc-nd-3.0-igo", "cc-by-nc-nd", "cc"),
    # CC0
    ("CC0 1.0", "cc0-1.0", "cc0", "cc0"),
    ("cc0 1.0", "cc0-1.0", "cc0", "cc0"),
    ("cc0-1.0", "cc0-1.0", "cc0", "cc0"),
    ("CC0", "cc0-1.0", "cc0", "cc0"),
    ("cc0", "cc0-1.0", "cc0", "cc0"),
    ("cc-zero", "cc0-1.0", "cc0", "cc0"),
    ("CC Zero", "cc0-1.0", "cc0", "cc0"),
    ("CC-Zero", "cc0-1.0", "cc0", "cc0"),
    ("creative commons zero", "cc0-1.0", "cc0", "cc0"),
    ("Creative Commons Zero 1.0", "cc0-1.0", "cc0", "cc0"),
    # CC-PDM
    ("cc-pdm", "cc-pdm-1.0", "cc-pdm", "public-domain"),
    ("CC-PDM", "cc-pdm-1.0", "cc-pdm", "public-domain"),
    ("cc-pdm-1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"),
    ("CC-PDM 1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"),
    ("cc-pdm 1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"),
    ("creative commons public domain", "cc-pdm-1.0", "cc-pdm", "public-domain"),
    # CC shorthand
    ("creative commons by", "cc-by", "cc-by", "cc"),
    ("creative commons by 4.0", "cc-by-4.0", "cc-by", "cc"),
    (
        "creative commons by-sa",
        "cc-by-sa",
        "cc-by-sa",
        "cc",
    ),  # Specifies by-sa, license must be cc-by-sa
    (
        "creative commons by-nc",
        "cc-by-nc",
        "cc-by-nc",
        "cc",
    ),  # Specifies by-nc, license must be cc-by-nc
    (
        "creative commons by-nc-sa",
        "cc-by-nc-sa",
        "cc-by-nc-sa",
        "cc",
    ),  # Specifies by-nc-sa, license must be cc-by-nc-sa
    (
        "creative commons by-nc-nd",
        "cc-by-nc-nd",
        "cc-by-nc-nd",
        "cc",
    ),  # Specifies by-nc-nd, license must be cc-by-nc-nd
    (
        "creative commons by-nd",
        "cc-by-nd",
        "cc-by-nd",
        "cc",
    ),  # Specifies by-nd, license must be cc-by-nd
    # CC URLs
    (
        "http://creativecommons.org/licenses/by-nc-nd/4.0/",
        "cc-by-nc-nd-4.0",
        "cc-by-nc-nd",
        "cc",
    ),
    ("https://creativecommons.org/licenses/by/4.0/", "cc-by-4.0", "cc-by", "cc"),
    ("http://creativecommons.org/licenses/by/4.0/", "cc-by-4.0", "cc-by", "cc"),
    (
        "https://creativecommons.org/licenses/by-nc/4.0/",
        "cc-by-nc-4.0",
        "cc-by-nc",
        "cc",
    ),
    (
        "https://creativecommons.org/licenses/by-nc-sa/4.0/",
        "cc-by-nc-sa-4.0",
        "cc-by-nc-sa",
        "cc",
    ),
    (
        "https://creativecommons.org/licenses/by-nd/4.0/",
        "cc-by-nd-4.0",
        "cc-by-nd",
        "cc",
    ),
    (
        "https://creativecommons.org/licenses/by-sa/4.0/",
        "cc-by-sa-4.0",
        "cc-by-sa",
        "cc",
    ),
    (
        "http://creativecommons.org/licenses/by-nc-nd/3.0/igo/",
        "cc-by-nc-nd-3.0-igo",
        "cc-by-nc-nd",
        "cc",
    ),
    (
        "https://creativecommons.org/licenses/by/3.0/igo/",
        "cc-by-3.0-igo",
        "cc-by",
        "cc",
    ),
    ("https://creativecommons.org/publicdomain/zero/1.0/", "cc0-1.0", "cc0", "cc0"),
    ("http://creativecommons.org/publicdomain/zero/1.0/", "cc0-1.0", "cc0", "cc0"),
    # CC prose
    ("licensed under cc by-nc-nd 4.0 terms", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
    (
        "content is licensed under creative commons by-nc-sa",
        "cc-by-nc-sa",
        "cc-by-nc-sa",  # Contains by-nc-sa, license must be cc-by-nc-sa
        "cc",
    ),
    ("this content is under creative commons by license", "cc-by", "cc-by", "cc"),
    # Open Data
    ("ODbL", "odbl", "odbl", "open-data"),
    ("odbl", "odbl", "odbl", "open-data"),
    ("Open Database License", "odbl", "odbl", "open-data"),
    ("ODC-BY", "odc-by", "odc-by", "open-data"),
    ("odc-by", "odc-by", "odc-by", "open-data"),
    ("PDDL", "pddl", "pddl", "open-data"),
    ("pddl", "pddl", "pddl", "open-data"),
    (
        "Open Data Commons Public Domain Dedication",
        "public-domain",
        "public-domain",
        "public-domain",
    ),
    # Publisher
    ("elsevier-oa", "elsevier-oa", "elsevier-oa", "publisher-oa"),
    (
        "Elsevier OA",
        "elsevier-oa",
        "elsevier-oa",
        "publisher-oa",
    ),  # "Elsevier OA" unambiguously identifies Elsevier OA license
    ("elsevier tdm", "elsevier-tdm", "elsevier-tdm", "publisher-tdm"),
    ("Elsevier TDM", "elsevier-tdm", "elsevier-tdm", "publisher-tdm"),
    ("Elsevier User License", "elsevier-oa", "elsevier-oa", "publisher-oa"),
    (
        "https://www.elsevier.com/open-access/userlicense/1.0/",
        "elsevier-oa",
        "elsevier-oa",
        "publisher-oa",
    ),
    ("wiley-tdm", "wiley-tdm", "wiley-tdm", "publisher-tdm"),
    ("Wiley TDM", "wiley-tdm", "wiley-tdm", "publisher-tdm"),
    ("wiley vor", "wiley-vor", "wiley-vor", "publisher-proprietary"),
    ("springer-tdm", "springer-tdm", "springer-tdm", "publisher-tdm"),
    (
        "Springer Nature TDM",
        "springernature-tdm",
        "springernature-tdm",
        "publisher-tdm",
    ),
    ("acs-authorchoice", "acs-authorchoice", "acs-authorchoice", "publisher-oa"),
    ("ACS AuthorChoice", "acs-authorchoice", "acs-authorchoice", "publisher-oa"),
    (
        "acs-authorchoice-ccby",
        "acs-authorchoice-ccby",
        "acs-authorchoice-ccby",
        "publisher-oa",
    ),
    (
        "acs authorchoice cc by",
        "acs-authorchoice-ccby",
        "acs-authorchoice-ccby",
        "publisher-oa",
    ),
    ("aps-default", "aps-default", "aps-default", "publisher-proprietary"),
    ("APS Default", "aps-default", "aps-default", "publisher-proprietary"),
    ("iop-tdm", "iop-tdm", "iop-tdm", "publisher-tdm"),
    ("iop copyright", "iop-copyright", "iop-copyright", "publisher-proprietary"),
    ("bmj copyright", "bmj-copyright", "bmj-copyright", "publisher-proprietary"),
    ("rsc terms", "rsc-terms", "rsc-terms", "publisher-proprietary"),
    ("cup terms", "cup-terms", "cup-terms", "publisher-proprietary"),
    ("degruyter terms", "degruyter-terms", "degruyter-terms", "publisher-proprietary"),
    ("tandf terms", "tandf-terms", "tandf-terms", "publisher-proprietary"),
    (
        "sage permissions",
        "sage-permissions",
        "sage-permissions",
        "publisher-proprietary",
    ),
    ("wiley terms", "wiley-terms", "wiley-terms", "publisher-proprietary"),
    ("wiley am", "wiley-am", "wiley-am", "publisher-proprietary"),
    ("pnas licenses", "pnas-licenses", "pnas-licenses", "publisher-proprietary"),
    (
        "aaas author reuse",
        "aaas-author-reuse",
        "aaas-author-reuse",
        "publisher-proprietary",
    ),
    ("aip rights", "aip-rights", "aip-rights", "publisher-proprietary"),
    ("jama cc by", "jama-cc-by", "jama-cc-by", "publisher-oa"),
    ("thieme nlm", "thieme-nlm", "thieme-nlm", "publisher-oa"),
    ("oup chorus", "oup-chorus", "oup-chorus", "publisher-oa"),
    ("implied oa", "implied-oa", "implied-oa", "publisher-oa"),
    ("implied open access", "implied-oa", "implied-oa", "publisher-oa"),
    ("unspecified oa", "unspecified-oa", "unspecified-oa", "other-oa"),
    (
        "publisher specific oa",
        "publisher-specific-oa",
        "publisher-specific-oa",
        "publisher-oa",
    ),
    ("author manuscript", "author-manuscript", "author-manuscript", "publisher-oa"),
    ("open access", "other-oa", "other-oa", "other-oa"),
    ("other-oa", "other-oa", "other-oa", "other-oa"),
    (
        "all rights reserved",
        "all-rights-reserved",
        "all-rights-reserved",
        "publisher-proprietary",
    ),
    ("no reuse", "no-reuse", "no-reuse", "publisher-proprietary"),
    # Publisher prose
    (
        "this article is licensed under elsevier tdm agreement",
        "elsevier-tdm",
        "elsevier-tdm",
        "publisher-tdm",
    ),
    (
        "journal article under elsevier user license for open access",
        "elsevier-oa",
        "elsevier-oa",
        "publisher-oa",
    ),
    (
        "acs authorchoice option was selected by the authors",
        "acs-authorchoice",
        "acs-authorchoice",
        "publisher-oa",
    ),
    (
        "springer tdm policy applies to this content",
        "springer-tdm",
        "springer-tdm",
        "publisher-tdm",
    ),
    # Unknown
    (
        "Totally Fake License XYZ999",
        "totally fake license xyz999",
        "totally fake license xyz999",
        "unknown",
    ),
    # Public domain
    ("public domain", "public-domain", "public-domain", "public-domain"),
    ("public-domain", "public-domain", "public-domain", "public-domain"),
    ("pd", "public-domain", "public-domain", "public-domain"),
]


@pytest.mark.parametrize(
    "raw,expected_key,expected_license,expected_family", LICENSE_MATRIX
)
def test_license_matrix(raw, expected_key, expected_license, expected_family):
    v = normalise_license(raw)
    assert v.key == expected_key, f"input: {raw!r}  key: {v.key!r} != {expected_key!r}"
    assert v.license.key == expected_license, (
        f"input: {raw!r}  license: {v.license.key!r} != {expected_license!r}"
    )
    assert v.family.key == expected_family, (
        f"input: {raw!r}  family: {v.family.key!r} != {expected_family!r}"
    )


def test_strict_mode_unknown_raises():
    with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)):
        normalise_license("xyzzy unknown license 123", strict=True)


def test_strict_mode_known_does_not_raise():
    v = normalise_license("mit", strict=False)
    assert v.key == "mit"


def test_empty_string_returns_unknown():
    v = normalise_license("")
    assert v.key == "unknown"
    assert v.family.key == "unknown"


def test_whitespace_only_returns_unknown():
    v = normalise_license("   \n\t  ")
    assert v.key == "unknown"


def test_batch_normalise_preserves_order():
    inputs = ["MIT", "Apache-2.0", "CC BY 4.0", "unknown garbage"]
    results = normalise_licenses(inputs)
    assert [r.key for r in results] == [
        "mit",
        "apache-2.0",
        "cc-by-4.0",
        "unknown garbage",
    ]


def test_normalise_mit():
    v = normalise_license("MIT")
    assert isinstance(v, LicenseVersion)
    assert v.key == "mit"
    assert str(v) == "mit"
    assert str(v.license) == "mit"


def test_normalise_cc():
    v = normalise_license("CC BY 4.0")
    assert v.key == "cc-by-4.0"
    assert str(v.license) == "cc-by"
    assert str(v.family) == "cc"


def test_batch():
    results = normalise_licenses(["MIT", "Apache-2.0"])
    assert len(results) == 2
    assert results[0].key == "mit"
    assert results[1].key == "apache-2.0"


def test_strict_mode_raises():
    with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)):
        normalise_license("Totally Fake License XYZ999", strict=True)


def test_strict_batch_raises():
    with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)):
        normalise_licenses(["MIT", "Fake License XYZ999"], strict=True)


def test_empty_input():
    v = normalise_license("")
    assert v.key == "unknown"
    v = normalise_license("   ")
    assert v.key == "unknown"


def test_real_world_license_strings():
    """Test against real-world license strings collected from the wild."""
    cases = [
        ("http://creativecommons.org/licenses/by-nc-nd/4.0/", "cc-by-nc-nd-4.0"),
        ("http://creativecommons.org/licenses/by/4.0/", "cc-by-4.0"),
        ("http://creativecommons.org/licenses/by-nc/4.0/", "cc-by-nc-4.0"),
        (
            "http://www.elsevier.com/open-access/userlicense/1.0/",
            "elsevier-oa",
        ),
        (
            "http://creativecommons.org/licenses/by-nc-nd/3.0/igo/",
            "cc-by-nc-nd-3.0-igo",
        ),
        ("CC BY-NC-ND 4.0", "cc-by-nc-nd-4.0"),
        (
            "http://creativecommons.org/licenses/by/3.0/igo/",
            "cc-by-3.0-igo",
        ),
    ]
    for raw, expected_key in cases:
        v = normalise_license(raw)
        assert v.key == expected_key, (
            f"input: {raw!r} -> got {v.key!r}, want {expected_key!r}"
        )

src/licence_normaliser/tests/test_models.py

"""Unit tests for _models.py."""

import pytest

from licence_normaliser._models import LicenseFamily, LicenseName, LicenseVersion

__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"


def _cc_fam():
    return LicenseFamily(key="cc")


def _osi_fam():
    return LicenseFamily(key="osi")


def _cc_by_name():
    return LicenseName(key="cc-by", family=_cc_fam())


def _mit_version():
    return LicenseVersion(
        key="mit",
        url="https://opensource.org/licenses/MIT",
        license=LicenseName(key="mit", family=_osi_fam()),
    )


class TestLicenseFamily:
    def test_str(self):
        assert str(LicenseFamily(key="cc")) == "cc"

    def test_repr(self):
        assert repr(LicenseFamily(key="osi")) == "LicenseFamily('osi')"

    def test_eq_same_type(self):
        assert LicenseFamily(key="cc") == LicenseFamily(key="cc")

    def test_eq_str(self):
        assert LicenseFamily(key="cc") == "cc"

    def test_neq(self):
        assert LicenseFamily(key="cc") != LicenseFamily(key="osi")

    def test_hash_usable_in_set(self):
        s = {LicenseFamily(key="cc"), LicenseFamily(key="cc"), LicenseFamily(key="osi")}
        assert len(s) == 2

    def test_frozen_prevents_mutation(self):
        fam = LicenseFamily(key="cc")
        with pytest.raises((AttributeError, TypeError)):
            fam.key = "other"  # type: ignore


class TestLicenseName:
    def test_str(self):
        assert str(_cc_by_name()) == "cc-by"

    def test_frozen_prevents_mutation(self):
        name = _cc_by_name()
        with pytest.raises((AttributeError, TypeError)):
            name.key = "other"  # type: ignore

    def test_family_reference(self):
        assert _cc_by_name().family.key == "cc"


class TestLicenseVersion:
    def test_str(self):
        assert str(_mit_version()) == "mit"

    def test_family_shortcut(self):
        assert _mit_version().family.key == "osi"

    def test_frozen_prevents_mutation(self):
        v = _mit_version()
        with pytest.raises((AttributeError, TypeError)):
            v.key = "other"  # type: ignore

    def test_url_stored(self):
        assert _mit_version().url == "https://opensource.org/licenses/MIT"

    def test_url_none(self):
        v = LicenseVersion(
            key="unknown",
            url=None,
            license=LicenseName(key="unknown", family=LicenseFamily(key="unknown")),
        )
        assert v.url is None

src/licence_normaliser/tests/test_prose.py

"""Tests for prose pattern matching via ProseParser."""

from licence_normaliser import normalise_license


class TestProsePatternMatching:
    def test_cc_by_nc_nd_4_0_prose(self):
        v = normalise_license("this work is licensed under cc by-nc-nd 4.0 terms")
        assert v.key == "cc-by-nc-nd-4.0"
        assert v.family.key == "cc"

    def test_cc_by_nc_nd_3_0_prose(self):
        v = normalise_license("license: cc by-nc-nd 3.0")
        assert v.key == "cc-by-nc-nd-3.0"
        assert v.family.key == "cc"

    def test_cc_by_nc_sa_creative_commons_prose(self):
        v = normalise_license("content licensed under creative commons by-nc-sa")
        assert v.key == "cc-by-nc-sa"
        assert v.family.key == "cc"

    def test_attribution_prose(self):
        v = normalise_license(
            "this content is made available under creative commons by license"
        )
        assert v.key == "cc-by"
        assert v.family.key == "cc"

    def test_attribution_noncommercial_prose(self):
        v = normalise_license(
            "this article is licensed under attribution noncommercial terms"
        )
        assert v.key == "cc-by-nc"
        assert v.family.key == "cc"

    def test_attribution_sharealike_prose(self):
        v = normalise_license("licensed under attribution share alike conditions")
        assert v.key == "cc-by-sa"
        assert v.family.key == "cc"

    def test_elsevier_tdm_prose(self):
        v = normalise_license(
            "this journal participates in text and data mining as "
            "permitted by the elsevier tdm agreement"
        )
        assert v.key == "elsevier-tdm"
        assert v.family.key == "publisher-tdm"

    def test_elsevier_user_license_prose(self):
        v = normalise_license(
            "elsevier user license applies to this open access article"
        )
        assert v.key == "elsevier-oa"
        assert v.family.key == "publisher-oa"

    def test_acs_authorchoice_prose(self):
        v = normalise_license("acs authorchoice option was selected by the authors")
        assert v.key == "acs-authorchoice"
        assert v.family.key == "publisher-oa"

    def test_all_rights_reserved_prose(self):
        v = normalise_license("all rights reserved except as permitted by law")
        assert v.key == "all-rights-reserved"
        assert v.family.key == "publisher-proprietary"

    def test_short_string_via_registry(self):
        v = normalise_license("cc by-nc-nd")
        assert v.key == "cc-by-nc-nd"
        assert v.family.key == "cc"

    def test_open_access_prose_matched(self):
        v = normalise_license("open access article available now")
        assert v.key == "other-oa"
        assert v.family.key == "other-oa"

src/licence_normaliser/tests/test_publisher.py

"""Tests for PublisherParser - publisher URLs and shorthand aliases."""

from licence_normaliser import normalise_license


class TestPublisherUrls:
    def test_elsevier_oa_url(self):
        v = normalise_license("https://www.elsevier.com/open-access/userlicense/1.0/")
        assert v.key == "elsevier-oa"
        assert v.family.key == "publisher-oa"

    def test_elsevier_oa_url_http(self):
        v = normalise_license("http://www.elsevier.com/open-access/userlicense/1.0/")
        assert v.key == "elsevier-oa"
        assert v.family.key == "publisher-oa"

    def test_elsevier_tdm_url(self):
        v = normalise_license("https://www.elsevier.com/tdm/userlicense/1.0/")
        assert v.key == "elsevier-tdm"
        assert v.family.key == "publisher-tdm"

    def test_wiley_tdm_url(self):
        v = normalise_license("http://doi.wiley.com/10.1002/tdm_license_1")
        assert v.key == "wiley-tdm"
        assert v.family.key == "publisher-tdm"

    def test_wiley_terms_url(self):
        v = normalise_license("https://onlinelibrary.wiley.com/terms-and-conditions")
        assert v.key == "wiley-terms"
        assert v.family.key == "publisher-proprietary"

    def test_springer_tdm_url(self):
        v = normalise_license("https://www.springer.com/tdm")
        assert v.key == "springer-tdm"
        assert v.family.key == "publisher-tdm"

    def test_springernature_tdm_url(self):
        v = normalise_license(
            "https://www.springernature.com/gp/researchers/text-and-data-mining"
        )
        assert v.key == "springernature-tdm"
        assert v.family.key == "publisher-tdm"

    def test_acs_authorchoice_ccby_url(self):
        v = normalise_license(
            "https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html"
        )
        assert v.key == "acs-authorchoice-ccby"
        assert v.family.key == "publisher-oa"

    def test_acs_authorchoice_url(self):
        v = normalise_license(
            "https://pubs.acs.org/page/policy/authorchoice_termsofuse.html"
        )
        assert v.key == "acs-authorchoice"
        assert v.family.key == "publisher-oa"

    def test_acs_authorchoice_nih_url(self):
        v = normalise_license(
            "https://pubs.acs.org/page/policy/"
            "acs_authorchoice_with_nih_addendum_termsofuse.html"
        )
        assert v.key == "acs-authorchoice-nih"
        assert v.family.key == "publisher-oa"

    def test_rsc_terms_url(self):
        v = normalise_license(
            "https://www.rsc.org/journals-books-databases/"
            "journal-authors-reviewers/licences-copyright-permissions/"
        )
        assert v.key == "rsc-terms"
        assert v.family.key == "publisher-proprietary"

    def test_iop_tdm_url(self):
        v = normalise_license(
            "https://iopscience.iop.org/info/page/text-and-data-mining"
        )
        assert v.key == "iop-tdm"
        assert v.family.key == "publisher-tdm"

    def test_bmj_copyright_url(self):
        v = normalise_license(
            "https://www.bmj.com/company/legal-stuff/copyright-notice/"
        )
        assert v.key == "bmj-copyright"
        assert v.family.key == "publisher-proprietary"

    def test_aaas_author_reuse_url(self):
        v = normalise_license(
            "https://www.science.org/content/page/science-licenses-journal-article-reuse"
        )
        assert v.key == "aaas-author-reuse"
        assert v.family.key == "publisher-proprietary"

    def test_aps_default_url(self):
        v = normalise_license("https://link.aps.org/licenses/aps-default-license")
        assert v.key == "aps-default"
        assert v.family.key == "publisher-proprietary"

    def test_aps_tdm_url(self):
        v = normalise_license(
            "https://link.aps.org/licenses/aps-default-text-mining-license"
        )
        assert v.key == "aps-tdm"
        assert v.family.key == "publisher-tdm"

    def test_cup_terms_url(self):
        v = normalise_license("https://www.cambridge.org/core/terms")
        assert v.key == "cup-terms"
        assert v.family.key == "publisher-proprietary"

    def test_aip_rights_url(self):
        v = normalise_license(
            "https://publishing.aip.org/authors/rights-and-permissions"
        )
        assert v.key == "aip-rights"
        assert v.family.key == "publisher-proprietary"

    def test_jama_cc_by_url(self):
        v = normalise_license("https://jamanetwork.com/pages/cc-by-license-permissions")
        assert v.key == "jama-cc-by"
        assert v.family.key == "publisher-oa"

    def test_oup_chorus_url(self):
        v = normalise_license(
            "https://academic.oup.com/journals/pages/open_access/"
            "funder_policies/chorus/standard_publication_model"
        )
        assert v.key == "oup-chorus"
        assert v.family.key == "publisher-oa"

    def test_oup_terms_url(self):
        v = normalise_license(
            "https://academic.oup.com/pages/standard-publication-reuse-rights"
        )
        assert v.key == "oup-terms"
        assert v.family.key == "publisher-proprietary"

    def test_sage_permissions_url(self):
        v = normalise_license("https://us.sagepub.com/en-us/nam/journals-permissions")
        assert v.key == "sage-permissions"
        assert v.family.key == "publisher-proprietary"

    def test_tandf_terms_url(self):
        v = normalise_license("https://www.tandfonline.com/action/showCopyRight")
        assert v.key == "tandf-terms"
        assert v.family.key == "publisher-proprietary"

    def test_gnu_gpl_url(self):
        v = normalise_license("https://www.gnu.org/licenses/gpl-3.0.html")
        assert v.key == "gpl-3.0"
        assert v.family.key == "copyleft"


class TestPublisherShorthand:
    def test_elsevier_user_license(self):
        v = normalise_license("elsevier user license")
        assert v.key == "elsevier-oa"
        assert v.family.key == "publisher-oa"

    def test_elsevier_tdm_shorthand(self):
        v = normalise_license("elsevier tdm")
        assert v.key == "elsevier-tdm"
        assert v.family.key == "publisher-tdm"

    def test_wiley_tdm_shorthand(self):
        v = normalise_license("wiley tdm license")
        assert v.key == "wiley-tdm"
        assert v.family.key == "publisher-tdm"

    def test_wiley_vor(self):
        v = normalise_license("wiley vor")
        assert v.key == "wiley-vor"
        assert v.family.key == "publisher-proprietary"

    def test_wiley_am(self):
        v = normalise_license("wiley am")
        assert v.key == "wiley-am"
        assert v.family.key == "publisher-proprietary"

    def test_springer_tdm_shorthand(self):
        v = normalise_license("springer tdm")
        assert v.key == "springer-tdm"
        assert v.family.key == "publisher-tdm"

    def test_springer_nature_tdm_shorthand(self):
        v = normalise_license("springer nature tdm")
        assert v.key == "springernature-tdm"
        assert v.family.key == "publisher-tdm"

    def test_acs_authorchoice_shorthand(self):
        v = normalise_license("acs authorchoice")
        assert v.key == "acs-authorchoice"
        assert v.family.key == "publisher-oa"

    def test_acs_authorchoice_ccby_shorthand(self):
        v = normalise_license("acs authorchoice cc by")
        assert v.key == "acs-authorchoice-ccby"
        assert v.family.key == "publisher-oa"

    def test_acs_authorchoice_nih_shorthand(self):
        v = normalise_license("acs authorchoice nih")
        assert v.key == "acs-authorchoice-nih"
        assert v.family.key == "publisher-oa"

    def test_rsc_terms_shorthand(self):
        v = normalise_license("rsc terms")
        assert v.key == "rsc-terms"
        assert v.family.key == "publisher-proprietary"

    def test_iop_tdm_shorthand(self):
        v = normalise_license("iop tdm")
        assert v.key == "iop-tdm"
        assert v.family.key == "publisher-tdm"

    def test_iop_copyright_shorthand(self):
        v = normalise_license("iop copyright")
        assert v.key == "iop-copyright"
        assert v.family.key == "publisher-proprietary"

    def test_bmj_copyright_shorthand(self):
        v = normalise_license("bmj copyright")
        assert v.key == "bmj-copyright"
        assert v.family.key == "publisher-proprietary"

    def test_aaas_author_reuse_shorthand(self):
        v = normalise_license("aaas author reuse")
        assert v.key == "aaas-author-reuse"
        assert v.family.key == "publisher-proprietary"

    def test_pnas_licenses_shorthand(self):
        v = normalise_license("pnas licenses")
        assert v.key == "pnas-licenses"
        assert v.family.key == "publisher-proprietary"

    def test_aps_default_shorthand(self):
        v = normalise_license("aps default")
        assert v.key == "aps-default"
        assert v.family.key == "publisher-proprietary"

    def test_aps_tdm_shorthand(self):
        v = normalise_license("aps tdm")
        assert v.key == "aps-tdm"
        assert v.family.key == "publisher-tdm"

    def test_cup_terms_shorthand(self):
        v = normalise_license("cup terms")
        assert v.key == "cup-terms"
        assert v.family.key == "publisher-proprietary"

    def test_aip_rights_shorthand(self):
        v = normalise_license("aip rights")
        assert v.key == "aip-rights"
        assert v.family.key == "publisher-proprietary"

    def test_jama_cc_by_shorthand(self):
        v = normalise_license("jama cc by")
        assert v.key == "jama-cc-by"
        assert v.family.key == "publisher-oa"

    def test_degruyter_terms_shorthand(self):
        v = normalise_license("degruyter terms")
        assert v.key == "degruyter-terms"
        assert v.family.key == "publisher-proprietary"

    def test_oup_chorus_shorthand(self):
        v = normalise_license("oup chorus")
        assert v.key == "oup-chorus"
        assert v.family.key == "publisher-oa"

    def test_oup_terms_shorthand(self):
        v = normalise_license("oup terms")
        assert v.key == "oup-terms"
        assert v.family.key == "publisher-proprietary"

    def test_sage_permissions_shorthand(self):
        v = normalise_license("sage permissions")
        assert v.key == "sage-permissions"
        assert v.family.key == "publisher-proprietary"

    def test_tandf_terms_shorthand(self):
        v = normalise_license("tandf terms")
        assert v.key == "tandf-terms"
        assert v.family.key == "publisher-proprietary"

    def test_thieme_nlm_shorthand(self):
        v = normalise_license("thieme nlm")
        assert v.key == "thieme-nlm"
        assert v.family.key == "publisher-oa"


class TestPublisherDirectKeys:
    def test_elsevier_tdm_key(self):
        v = normalise_license("elsevier-tdm")
        assert v.key == "elsevier-tdm"
        assert v.family.key == "publisher-tdm"

    def test_elsevier_oa_key(self):
        v = normalise_license("elsevier-oa")
        assert v.key == "elsevier-oa"
        assert v.family.key == "publisher-oa"

    def test_wiley_tdm_key(self):
        v = normalise_license("wiley-tdm")
        assert v.key == "wiley-tdm"
        assert v.family.key == "publisher-tdm"

    def test_acs_authorchoice_key(self):
        v = normalise_license("acs-authorchoice")
        assert v.key == "acs-authorchoice"
        assert v.family.key == "publisher-oa"

    def test_acs_authorchoice_ccby_key(self):
        v = normalise_license("acs-authorchoice-ccby")
        assert v.key == "acs-authorchoice-ccby"
        assert v.family.key == "publisher-oa"

    def test_acs_authorchoice_nih_key(self):
        v = normalise_license("acs-authorchoice-nih")
        assert v.key == "acs-authorchoice-nih"
        assert v.family.key == "publisher-oa"

    def test_iop_tdm_key(self):
        v = normalise_license("iop-tdm")
        assert v.key == "iop-tdm"
        assert v.family.key == "publisher-tdm"

    def test_aps_tdm_key(self):
        v = normalise_license("aps-tdm")
        assert v.key == "aps-tdm"
        assert v.family.key == "publisher-tdm"

    def test_oup_chorus_key(self):
        v = normalise_license("oup-chorus")
        assert v.key == "oup-chorus"
        assert v.family.key == "publisher-oa"

    def test_jama_cc_by_key(self):
        v = normalise_license("jama-cc-by")
        assert v.key == "jama-cc-by"
        assert v.family.key == "publisher-oa"

    def test_thieme_nlm_key(self):
        v = normalise_license("thieme-nlm")
        assert v.key == "thieme-nlm"
        assert v.family.key == "publisher-oa"

    def test_implied_oa_key(self):
        v = normalise_license("implied-oa")
        assert v.key == "implied-oa"
        assert v.family.key == "publisher-oa"

    def test_unspecified_oa_key(self):
        v = normalise_license("unspecified-oa")
        assert v.key == "unspecified-oa"
        assert v.family.key == "other-oa"

    def test_author_manuscript_key(self):
        v = normalise_license("author-manuscript")
        assert v.key == "author-manuscript"
        assert v.family.key == "publisher-oa"

    def test_all_rights_reserved_key(self):
        v = normalise_license("all-rights-reserved")
        assert v.key == "all-rights-reserved"
        assert v.family.key == "publisher-proprietary"

    def test_no_reuse_key(self):
        v = normalise_license("no-reuse")
        assert v.key == "no-reuse"
        assert v.family.key == "publisher-proprietary"

    def test_other_oa_key(self):
        v = normalise_license("other-oa")
        assert v.key == "other-oa"
        assert v.family.key == "other-oa"

    def test_public_domain_key(self):
        v = normalise_license("public-domain")
        assert v.key == "public-domain"
        assert v.family.key == "public-domain"

    def test_open_access_key(self):
        v = normalise_license("open-access")
        assert v.key == "other-oa"
        assert v.family.key == "other-oa"


class TestPublisherCatchAll:
    def test_implied_oa_shorthand(self):
        v = normalise_license("implied oa")
        assert v.key == "implied-oa"
        assert v.family.key == "publisher-oa"

    def test_unspecified_oa_shorthand(self):
        v = normalise_license("unspecified oa")
        assert v.key == "unspecified-oa"
        assert v.family.key == "other-oa"

    def test_open_access_shorthand(self):
        v = normalise_license("open access")
        assert v.key == "other-oa"
        assert v.family.key == "other-oa"

    def test_author_manuscript_shorthand(self):
        v = normalise_license("author manuscript")
        assert v.key == "author-manuscript"
        assert v.family.key == "publisher-oa"

    def test_all_rights_reserved_shorthand(self):
        v = normalise_license("all rights reserved")
        assert v.key == "all-rights-reserved"
        assert v.family.key == "publisher-proprietary"

    def test_no_reuse_shorthand(self):
        v = normalise_license("no reuse")
        assert v.key == "no-reuse"
        assert v.family.key == "publisher-proprietary"


class TestCCPublicDomain:
    def test_cc_pdm_bare_key(self):
        v = normalise_license("cc-pdm")
        assert v.key == "cc-pdm-1.0"
        assert v.family.key == "public-domain"

    def test_cc_pdm_versioned_key(self):
        v = normalise_license("cc-pdm-1.0")
        assert v.key == "cc-pdm-1.0"
        assert v.family.key == "public-domain"

    def test_cc0_bare_key(self):
        v = normalise_license("cc0")
        assert v.key == "cc0-1.0"
        assert v.family.key == "cc0"

    def test_cc0_versioned_key(self):
        v = normalise_license("cc0-1.0")
        assert v.key == "cc0-1.0"
        assert v.family.key == "cc0"

    def test_cc_zero_shorthand(self):
        v = normalise_license("cc-zero")
        assert v.key == "cc0-1.0"
        assert v.family.key == "cc0"

    def test_public_domain_fallback(self):
        v = normalise_license("public-domain")
        assert v.key == "public-domain"
        assert v.family.key == "public-domain"

    def test_creative_commons_zero(self):
        v = normalise_license("creative commons zero")
        assert v.key == "cc0-1.0"
        assert v.family.key == "cc0"

    def test_creative_commons_public_domain(self):
        v = normalise_license("creative commons public domain")
        assert v.key == "cc-pdm-1.0"
        assert v.family.key == "public-domain"

View llms.txt version