Project source-tree ******************* Below is the layout of the project (to 10 levels), followed by the contents of each key file. Project directory layout licence-normaliser/ ├── scripts │ ├── __init__.py │ ├── check_missing_aliases.py │ ├── compare_datasets.py │ ├── README.rst │ └── test_name_inference.py ├── src │ └── licence_normaliser │ ├── cli │ │ ├── __init__.py │ │ └── _main.py │ ├── data │ │ ├── aliases │ │ │ └── aliases.json │ │ ├── prose │ │ │ └── prose_patterns.json │ │ ├── publishers │ │ │ └── publishers.json │ │ ├── urls │ │ │ └── url_map.json │ │ └── README.rst │ ├── parsers │ │ ├── __init__.py │ │ ├── alias.py │ │ ├── creativecommons.py │ │ ├── opendefinition.py │ │ ├── osi.py │ │ ├── prose.py │ │ ├── publisher.py │ │ ├── scancode_licensedb.py │ │ └── spdx.py │ ├── tests │ │ ├── __init__.py │ │ ├── conftest.py │ │ ├── test_aliases.py │ │ ├── test_cache.py │ │ ├── test_cli.py │ │ ├── test_core.py │ │ ├── test_exceptions.py │ │ ├── test_integration.py │ │ ├── test_models.py │ │ ├── test_prose.py │ │ └── test_publisher.py │ ├── __init__.py │ ├── _cache.py │ ├── _core.py │ ├── _models.py │ ├── _normaliser.py │ ├── _trace.py │ ├── defaults.py │ ├── exceptions.py │ ├── plugins.py │ └── py.typed ├── AGENTS.md ├── conftest.py ├── CONTRIBUTING.rst ├── docker-compose.yml ├── Dockerfile ├── Makefile ├── pyproject.toml ├── README.rst └── tox.ini README.rst ========== README.rst ================== licence-normaliser ================== .. image:: https://raw.githubusercontent.com/barseghyanartur/licence-normaliser/main/docs/_static/licence_normaliser_logo.webp :alt: licence-normaliser logo :align: center Comprehensive license normalsation with a three-level hierarchy. .. image:: https://img.shields.io/pypi/v/licence-normaliser.svg :target: https://pypi.python.org/pypi/licence-normaliser :alt: PyPI Version .. image:: https://img.shields.io/pypi/pyversions/licence-normaliser.svg :target: https://pypi.python.org/pypi/licence-normaliser/ :alt: Supported Python versions .. image:: https://github.com/barseghyanartur/licence-normaliser/actions/workflows/test.yml/badge.svg?branch=main :target: https://github.com/barseghyanartur/licence-normaliser/actions :alt: Build Status .. image:: https://readthedocs.org/projects/licence-normaliser/badge/?version=latest :target: http://licence-normaliser.readthedocs.io :alt: Documentation Status .. image:: https://img.shields.io/badge/docs-llms.txt-blue :target: https://licence-normaliser.readthedocs.io/en/latest/llms.txt :alt: llms.txt - documentation for LLMs .. image:: https://img.shields.io/badge/license-MIT-blue.svg :target: https://github.com/barseghyanartur/licence-normaliser/#License :alt: MIT .. image:: https://coveralls.io/repos/github/barseghyanartur/licence-normaliser/badge.svg?branch=main&service=github :target: https://coveralls.io/github/barseghyanartur/licence-normaliser?branch=main :alt: Coverage ``licence-normaliser`` is a comprehensive license normalisation library that maps any license representation (SPDX tokens, URLs, prose descriptions) to a canonical three-level hierarchy. Features ======== - **Three-level hierarchy** - LicenseFamily → LicenseName → LicenseVersion. - **Wide format support** - SPDX tokens, URLs, prose descriptions. - **Creative Commons support** - Full CC family with versions and IGO variants. - **Publisher-specific licenses** - Springer, Nature, Elsevier, Wiley, ACS, and more. - **File-driven data** - Add aliases, URLs, and patterns by editing JSON files. No Python code changes required for new synonyms. - **Pluggable parsers** - Drop in a new parser class to ingest any external license registry. Parsers implement plugin interfaces (``RegistryPlugin``, ``URLPlugin``, etc.). - **Strict mode** - Raise ``LicenseNotFoundError`` instead of silently returning ``"unknown"``. - **Caching** - LRU caching for performance. - **CLI** - Command-line interface with ``--strict`` and ``--explain`` support. Hierarchy ========= The library uses a three-level hierarchy: 1. **LicenseFamily** - broad bucket: ``"cc"``, ``"osi"``, ``"copyleft"``, ``"publisher-tdm"``, ... 2. **LicenseName** - version-free: ``"cc-by"``, ``"cc-by-nc-nd"``, ``"mit"``, ``"wiley-tdm"`` 3. **LicenseVersion** - fully resolved: ``"cc-by-3.0"``, ``"cc-by-nc-nd-4.0"`` Installation ============ With ``uv``: .. code-block:: sh uv pip install licence-normaliser Or with ``pip``: .. code-block:: sh pip install licence-normaliser Quick start =========== .. code-block:: python :name: test_quick_start from licence_normaliser import normalise_license v = normalise_license("CC BY-NC-ND 4.0") str(v) # "cc-by-nc-nd-4.0" ← LicenseVersion str(v.license) # "cc-by-nc-nd" ← LicenseName str(v.license.family) # "cc" ← LicenseFamily Strict mode =========== By default, unresolvable inputs return an ``"unknown"`` result. Pass ``strict=True`` to raise ``LicenseNotFoundError`` instead: .. code-block:: python :name: test_strict_mode from licence_normaliser import normalise_license from licence_normaliser.exceptions import LicenseNotFoundError # Silent fallback (default) v = normalise_license("some-unknown-string") v.family.key # "unknown" # Strict: raises on unresolvable input try: v = normalise_license("some-unknown-string", strict=True) except LicenseNotFoundError as exc: print(exc.raw) # original input print(exc.cleaned) # cleaned form that failed lookup Trace / Explain =============== Set ``ENABLE_LICENCE_NORMALISER_TRACE=1`` or pass ``trace=True`` to get resolution traces showing how the license was matched: .. code-block:: python :name: test_trace from licence_normaliser import normalise_license # Via function v = normalise_license("cc by-nc-nd 3.0 igo", trace=True) print(v.explain()) # Via class from licence_normaliser import LicenseNormaliser ln = LicenseNormaliser(trace=True) v = ln.normalise_license("MIT") print(v.explain()) Output shows the resolution pipeline (alias → registry → url → prose → fallback) and which source file + line matched: .. code-block:: text Input: 'cc by-nc-nd 3.0 igo' → 'cc by-nc-nd 3.0 igo' [✓] alias: 'cc by-nc-nd 3.0 igo' → 'cc-by-nc-nd-3.0-igo' (line 139 in aliases.json) Result: version_key: 'cc-by-nc-nd-3.0-igo' name_key: 'cc-by-nc-nd' family_key: 'cc' The trace can also be accessed via ``v._trace`` for programmatic use. Batch normalisation =================== .. code-block:: python :name: test_batch_normalisation from licence_normaliser import normalise_licenses results = normalise_licenses(["MIT", "Apache-2.0", "CC BY 4.0"]) for r in results: print(r.key) # Strict batch - raises on first unresolvable results = normalise_licenses(["MIT", "Apache-2.0"], strict=True) Custom plugins ============== The ``LicenseNormaliser`` class lets you inject custom plugin classes for specialised use cases: .. code-block:: python :name: test_custom_plugins from licence_normaliser import LicenseNormaliser from licence_normaliser.parsers.alias import AliasParser from licence_normaliser.parsers.spdx import SPDXParser # Use only SPDX + Alias plugins (no CC, no publisher URLs) ln = LicenseNormaliser( registry=[SPDXParser], alias=[AliasParser], family=[AliasParser], name=[AliasParser], cache=True, cache_maxsize=8192, ) # MIT resolves via SPDX parser assert str(ln.normalise_license("MIT")) == "mit" # CC BY resolves via Alias assert str(ln.normalise_license("CC BY-NC-ND 4.0")) == "cc-by-nc-nd-4.0" .. note:: Explicit plugin passing is optional — ``LicenseNormaliser()`` automatically loads defaults. Use the pattern above only if you need custom plugins or reduce number of plugins loaded. For caching, ``LicenseNormaliser`` wraps the resolution method with ``lru_cache``. Disable it by passing ``cache=False`` for debugging: .. code-block:: python :name: test_caching from licence_normaliser import LicenseNormaliser ln = LicenseNormaliser(cache=False) result = ln.normalise_license("MIT") Update data (CLI) ================= .. code-block:: sh licence-normaliser update-data --force # Fetches fresh SPDX, OpenDefinition, OSI, CreativeCommons, and ScanCode JSONs Integration tests (public API only) =================================== All integration tests live in ``src/licence_normaliser/tests/test_integration.py`` and only import the public API. CLI usage ========= Normalise a single license: .. code-block:: sh licence-normaliser normalise "MIT" # Output: mit licence-normaliser normalise --full "CC BY 4.0" # Output: # Key: cc-by-4.0 # URL: https://creativecommons.org/licenses/by/4.0/ # License: cc-by # Family: cc licence-normaliser normalise --strict "totally-unknown" # Exits with code 1 and prints an error Batch normalise: .. code-block:: sh licence-normaliser batch MIT "Apache-2.0" "CC BY 4.0" licence-normaliser batch --strict MIT "Apache-2.0" Exceptions ========== .. code-block:: python :name: test_exceptions from licence_normaliser.exceptions import ( LicenseNormaliserError, # base class LicenseNotFoundError, # raised by strict mode ) Testing ======= All tests run inside Docker: .. code-block:: sh make test To test a specific Python version: .. code-block:: sh make test-env ENV=py312 License ======= MIT Author ====== Artur Barseghyan CONTRIBUTING.rst ================ CONTRIBUTING.rst ====================== Contributor guidelines ====================== .. _licence-normaliser: https://github.com/barseghyanartur/licence-normaliser/ .. _uv: https://docs.astral.sh/uv/ .. _tox: https://tox.wiki .. _ruff: https://beta.ruff.rs/docs/ .. _doc8: https://doc8.readthedocs.io/ .. _pre-commit: https://pre-commit.com/#installation .. _issues: https://github.com/barseghyanartur/licence-normaliser/issues .. _discussions: https://github.com/barseghyanartur/licence-normaliser/discussions .. _pull request: https://github.com/barseghyanartur/licence-normaliser/pulls .. _versions manifest: https://github.com/actions/python-versions/blob/main/versions-manifest.json Developer prerequisites ----------------------- pre-commit ~~~~~~~~~~ Refer to `pre-commit`_ for installation instructions. TL;DR: .. code-block:: sh curl -LsSf https://astral.sh/uv/install.sh | sh # Install uv uv tool install pre-commit # Install pre-commit pre-commit install # Install hooks Installing `pre-commit`_ ensures all contributions adhere to the project's code quality standards. Code standards -------------- `ruff`_ and `doc8`_ are triggered automatically by `pre-commit`_. To run checks manually: .. code-block:: sh make doc8 make ruff Import conventions ~~~~~~~~~~~~~~~~~~ **Import statements belong at module level.** Avoid placing imports inside functions or methods unless absolutely necessary: - **Acceptable exceptions:** - Breaking circular dependencies - Optional runtime dependencies (e.g., CLI-only imports) - Heavy imports that are rarely used - **Why this matters:** - Improves code readability - Makes dependencies explicit and discoverable - Enables static analysis tools to work correctly - Follows Python community best practices (PEP 8) When in doubt, place imports at the top of the file. Virtual environment ------------------- .. code-block:: sh make create-venv Installation ------------ .. code-block:: sh make install Testing ------- .. note:: Python 3.15 is being tested on GitHub CI, but not inside a local Docker image. Docker-based testing (recommended) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All tests run inside Docker for platform independence and consistency: .. code-block:: sh make test # full matrix (Python 3.10-3.14) make test-env ENV=py312 # single Python version make shell # interactive shell in test container make shell-env ENV=py312 # interactive shell for specific Python Local testing (alternative) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ For faster iteration during development, you can run tests locally with ``uv``: .. code-block:: sh make install # one-time setup uv run pytest # run all tests uv run pytest path/to/test_something.py # run specific test **Important**: If you encounter tooling errors with local testing, fall back to Docker-based testing which is the canonical environment. GitHub Actions ~~~~~~~~~~~~~~ In any case, GitHub Actions runs the full matrix automatically on every push. Tests run on Python 3.10–3.15 (all non-EOL versions). See the `versions manifest`_ for the full list of available Python versions. Adding new normalisation rules ------------------------------ For a new **alias** or **family override** for an *existing* license: 1. Add an entry to ``src/licence_normaliser/data/aliases/aliases.json``. 2. Optionally, add an ``aliases`` array to define additional lookup variants (e.g. hyphen vs space forms) that resolve to the same target: .. code-block:: json { "cc by-nc": { "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc", "aliases": ["cc-by-nc", "cc by nc", "cc-by nc"] } } 3. Add a test in ``src/licence_normaliser/tests/test_aliases.py`` or ``test_alias_expansion.py``. 4. No Python changes needed. For a new **prose pattern** (regex matching free-text descriptions): 1. Add an entry to ``src/licence_normaliser/data/prose/prose_patterns.json``. 2. Add a test in ``src/licence_normaliser/tests/test_prose.py``. 3. No Python changes needed. For a new **URL mapping**: 1. Add an entry to ``src/licence_normaliser/data/urls/url_map.json`` or ``src/licence_normaliser/data/publishers/publishers.json``. 2. Add a test in ``src/licence_normaliser/tests/test_publisher.py``. 3. No Python changes needed. For a **brand-new license key** (SPDX, OpenDefinition, OSI, CC, or ScanCode): 1. The upstream data source must be updated first (``licence-normaliser update-data --force`` for SPDX/OpenDefinition, or edit the upstream source for OSI/CC/ScanCode). 2. The parser will pick it up automatically on the next import. 3. Add an alias in ``aliases.json`` if needed. 4. Add family override in ``aliases.json`` if needed. 5. Add tests. For a **new parser** (new upstream data source): 1. Create ``src/licence_normaliser/parsers/my_parser.py`` implementing ``BasePlugin``. 2. Register it in ``src/licence_normaliser/parsers/__init__.py``. 3. Set ``is_registry_entry = False`` if the parser only contributes aliases/URLs/patterns (not new license keys). 4. Add tests. Releases -------- **Build the package for releasing:** .. code-block:: sh make package-build ---- **Test the built package:** .. code-block:: sh make check-package-build ---- **Make a test release (test.pypi.org):** .. code-block:: sh make test-release ---- **Release (pypi.org):** .. code-block:: sh make release Adding tests ------------ - Every new normalisation rule must have a corresponding test. - Tests should cover both successful normalisation and edge cases. Pull requests ------------- Open a `pull request`_ to the ``dev`` branch only. Never directly to ``main``. .. note:: Create pull requests to the ``dev`` branch only! Examples of welcome contributions: - Fixing documentation typos or improving explanations. - Adding test cases for new edge cases. - Extending support for additional license formats. - Improving error messages. General checklist ~~~~~~~~~~~~~~~~~ - Does your change require documentation updates (``README.rst``, ``AGENTS.md``, ``ARCHITECTURE.rst``, ``CONTRIBUTING.rst``)? - Does your change require new tests? - Does your change add any external dependencies? If so, reconsider: ``licence-normaliser`` should have minimal dependencies. When fixing bugs ~~~~~~~~~~~~~~~~ - Add a regression test that reproduces the bug before your fix. When adding a new feature ~~~~~~~~~~~~~~~~~~~~~~~~~ - Update ``README.rst``, ``AGENTS.md``, and ``ARCHITECTURE.rst`` if applicable. - Add appropriate tests. Questions --------- Ask on GitHub `discussions`_. Issues ------ Report bugs or request features on GitHub `issues`_. AGENTS.md ========= AGENTS.md # AGENTS.md - licence-normaliser **Repository**: https://github.com/barseghyanartur/licence-normaliser **Maintainer**: Artur Barseghyan --- ## 1. Project Mission (Never Deviate) > Comprehensive license normalisation with a three-level hierarchy - secure, > fast, and extensible. - Maps any license representation to a canonical three-level hierarchy - Supports SPDX tokens, URLs, prose descriptions - No external dependencies (only optional dev/test deps) - LRU caching for performance - Data-file-driven: parsers load from package data JSON files - `licence-normaliser update-data` CLI command to refresh SPDX + OpenDefinition data --- ## 2. Architecture ### Three-Level Hierarchy | Level | Class | Example | | ----- | ----- | ------- | | **Family** | `LicenseFamily` | `"cc"`, `"osi"`, `"copyleft"`, `"data"` | | **Name** | `LicenseName` | `"cc-by"`, `"mit"`, `"gpl-3.0-only"` | | **Version** | `LicenseVersion` | `"cc-by-4.0"`, `"mit"`, `"gpl-3.0-only"` | ### Resolution Pipeline 1. **Alias table** - cleaned lowercase key matches `ALIASES` (loaded from `data/aliases/aliases.json`) 2. **Direct registry lookup** - hit in `REGISTRY` (SPDX, OpenDefinition, OSI, CC, ScanCode license keys) 3. **URL map** - hit in `URL_MAP` (loaded from SPDX + OpenDefinition + publisher data) 4. **Prose pattern scan** - regex patterns from `data/prose/prose_patterns.json` (for strings >20 chars) 5. **Fallback** - key = cleaned string, family = unknown ### Key Files | File | Purpose | | ---- | ------- | | `src/licence_normaliser/_models.py` | Frozen dataclass hierarchy | | `src/licence_normaliser/_normaliser.py` | `LicenseNormaliser` class with plugin-based resolution | | `src/licence_normaliser/plugins.py` | Plugin interfaces (BasePlugin, RegistryPlugin, URLPlugin, etc.) | | `src/licence_normaliser/defaults.py` | Lazy-loading default plugin bundle | | `src/licence_normaliser/_cache.py` | Module-level API delegating to `LicenseNormaliser` | | `src/licence_normaliser/parsers/` | Parser classes implementing plugin interfaces | | `src/licence_normaliser/cli/_main.py` | CLI with normalise, batch, update-data | | `src/licence_normaliser/exceptions.py` | LicenseNormalisationError | | `src/licence_normaliser/data/spdx/spdx.json` | **DO NOT MODIFY** Full SPDX license list (loaded at runtime) | | `src/licence_normaliser/data/opendefinition/opendefinition.json` | **DO NOT MODIFY** Full OpenDefinition list (loaded at runtime) | | `src/licence_normaliser/data/aliases/aliases.json` | Curated aliases with rich metadata | | `src/licence_normaliser/data/prose/prose_patterns.json` | Curated prose regex patterns | | `src/licence_normaliser/data/publishers/publishers.json` | Publisher URLs and shorthand aliases | --- ## 3. Using licence-normaliser in Application Code ### Simple case ```python name=test_simple_case from licence_normaliser import normalise_license v = normalise_license("MIT") str(v) # "mit" ``` ### With full hierarchy ```python name=test_full_hierarchy v = normalise_license("CC BY-NC-ND 4.0") print(v.key) # "cc-by-nc-nd-4.0" print(v.license.key) # "cc-by-nc-nd" print(v.family.key) # "cc" ``` ### Strict mode ```python name=test_strict_mode import pytest from licence_normaliser import normalise_license, LicenseNotFoundError # Would normally raise: License not found: 'unknown string' with pytest.raises(LicenseNotFoundError): v = normalise_license("unknown string", strict=True) # Batch strict from licence_normaliser import normalise_licenses with pytest.raises(LicenseNotFoundError): results = normalise_licenses( ["unknown string", "unknown string 2.0"], strict=True, ) ``` ### Custom plugins with LicenseNormaliser The `LicenseNormaliser` class lets you inject custom plugin classes for specialised use cases: ```python name=test_custom_plugins from licence_normaliser import LicenseNormaliser from licence_normaliser.parsers.spdx import SPDXParser from licence_normaliser.parsers.alias import AliasParser # Use only SPDX + Alias plugins (no CC, no publisher URLs) ln = LicenseNormaliser( registry=[SPDXParser], alias=[AliasParser], family=[AliasParser], name=[AliasParser], ) # MIT resolves via SPDX parser assert str(ln.normalise_license("MIT")) == "mit" # CC BY resolves via Alias assert str(ln.normalise_license("CC BY-NC-ND 4.0")) == "cc-by-nc-nd-4.0" ``` To use all defaults, import from `defaults`: ```python name=test_defaults_usage from licence_normaliser import LicenseNormaliser from licence_normaliser.defaults import ( get_default_registry, get_default_url, get_default_alias, get_default_family, get_default_name, get_default_prose, ) ln = LicenseNormaliser( registry=get_default_registry(), url=get_default_url(), alias=get_default_alias(), family=get_default_family(), name=get_default_name(), prose=get_default_prose(), cache=True, cache_maxsize=8192, ) result = ln.normalise_license("MIT") ``` > [!NOTE] > Explicit plugin passing is optional — `LicenseNormaliser()` automatically > loads defaults. Use the pattern above only if you need custom plugins. For caching, `LicenseNormaliser` wraps the resolution method with `lru_cache`. Disable it by passing `cache=False` for debugging: ```python name=test_caching from licence_normaliser import LicenseNormaliser ln = LicenseNormaliser(cache=False) result = ln.normalise_license("MIT") ``` --- ## 4. Updating Data Sources SPDX and OpenDefinition data can be updated via the CLI: ```sh licence-normaliser update-data --force ``` This fetches fresh JSON from the authoritative upstream URLs and writes them to: - `src/licence_normaliser/data/spdx/spdx.json` - `src/licence_normaliser/data/opendefinition/opendefinition.json` --- ## 4a. Trace / Explain When debugging why a license resolves a certain way, or aligning curated data sources, use the trace feature: **Via CLI:** ```sh licence-normaliser normalise "MIT" --trace licence-normaliser normalise "CC BY-NC-ND 3.0 igo" --trace licence-normaliser batch MIT Apache --trace ``` Or via environment variable: ```sh ENABLE_LICENCE_NORMALISER_TRACE=1 licence-normaliser normalise "MIT" ``` **Via Python:** ```python name=test_trace from licence_normaliser import normalise_license v = normalise_license("MIT", trace=True) print(v.explain()) ``` The trace shows: - Each resolution stage attempted (alias → registry → url → prose → fallback) - Whether it matched (✓) or didn't (-) - Source file and line number for curated sources (aliases.json, publishers.json, prose_patterns.json) - Final result with version_key, name_key, family_key This is essential for: - Understanding why a license resolves unexpectedly - Finding the source line that defines an alias when curating data - Debugging resolution order issues --- ## 5. Adding a New Parser Parsers implement plugin interfaces and can be added to `src/licence_normaliser/parsers/`: 1. Create `src/licence_normaliser/parsers/my_parser.py` implementing one or more plugin interfaces: ```python name=test_adding_new_parser from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin class MyParser(BasePlugin, RegistryPlugin, URLPlugin): url = None # or upstream URL for refresh local_path = "data/my_parser/my_data.json" def load_registry(self) -> dict[str, str]: # Return {"license_key": "license_key", ...} return {} def load_urls(self) -> dict[str, str]: # Return {"https://...": "license_key", ...} return {} ``` 2. Register it in `src/licence_normaliser/defaults.py`: ```python name=test_adding_new_parser_register from licence_normaliser.parsers.spdx import SPDXParser def _load_registry_plugins() -> list[type]: # ... other imports return [ SPDXParser, # ... other plugins MyParser, ] ``` **Key attribute**: Set `url = None` on parsers that only contribute local data (no refresh capability). --- ## 6. Coding Conventions - Line length: **88 characters** (ruff) - Every non-test module must have `__all__`, `__author__`, `__copyright__`, `__license__` - Always chain exceptions: `raise X(...) from exc` - Type annotations on all public functions - Target: `py310` - Import statements: Avoid imports inside functions/methods unless absolutely necessary (e.g., breaking circular dependencies or optional runtime dependencies). Lazy imports harm readability and make dependencies unclear. Run linting: `make ruff` or `make pre-commit` --- ## 7. Agent Workflow: Adding Features or Fixing Bugs 1. **Check the mission** - does the change preserve the no-dependencies policy and three-level hierarchy? 2. **Identify the correct location**: - New SPDX/OD license → update SPDX/OpenDefinition JSON files (run `update-data`) - New alias or family override → add to `data/aliases/aliases.json` - **Use `--trace` to find the exact line that defines an alias** - New URL mapping → add to `data/publishers/publishers.json` - New prose pattern → add to `data/prose/prose_patterns.json` - New parser → `parsers/my_parser.py` + `defaults.py` - Core pipeline change → `_normaliser.py` or `_cache.py` 3. **Write tests** covering both success and error cases 4. **Update README.rst** if the API changed 5. **Suggest running**: `make test-env ENV=py312` then `make test` 6. **Suggest running**: `make pre-commit` --- ## 8. Testing Rules > [!NOTE] > Python 3.15 is being tested on GitHub CI, but not inside a local Docker image. ### Docker-based testing (recommended) All tests run inside Docker for platform independence and consistency: ```sh make test # full matrix (Python 3.10-3.14) make test-env ENV=py312 # single version make shell # interactive shell in test container ``` ### Local testing (alternative) For faster iteration during development, you can run tests locally with `uv`: ```sh make install # one-time setup uv run pytest # run all tests uv run pytest path/to/test_something.py # run specific test ``` **Important**: If you encounter tooling errors with local testing, fall back to Docker-based testing which is the canonical environment. ### Test layout ```text src/licence_normaliser/tests/ test_integration.py - public API only (survives any rewrite) test_core.py - end-to-end pipeline tests test_exceptions.py - exception hierarchy and strict mode test_cli.py - CLI commands including update-data test_models.py - LicenseFamily, LicenseName, LicenseVersion test_aliases.py - non-CC aliases (Apache, MIT, BSD, GPL, etc.) test_alias_expansion.py - explicit aliases array expansion feature test_publisher.py - publisher URLs and shorthand aliases test_prose.py - prose pattern matching ``` ### Documentation snippet conventions Code blocks in this file use two special attributes to support chained executable tests: - `name=` — labels a snippet so it can be referenced later. - `` placed immediately before a code block means that block **continues** the named snippet; all names, imports, and variables defined in the named block are already in scope and must **not** be re-imported or re-declared in the continuation block. Example: ```python name=test_my_example class Foo: pass ``` ```python name=test_my_example_continued foo = Foo() # Foo is in scope from the named block above assert isinstance(foo, Foo) ``` --- ## 9. Forbidden - Adding external dependencies - Removing existing normalisation coverage - Changing the three-level hierarchy structure - Modifying the following files is strictly forbidden: - `src/licence_normaliser/data/creativecommons/creativecommons.json` - `src/licence_normaliser/data/opendefinition/opendefinition.json` - `src/licence_normaliser/data/osi/osi.json` - `src/licence_normaliser/data/scancode_licensedb/scancode_licensedb.json` - `src/licence_normaliser/data/spdx/spdx.json` Use `licence-normaliser update-data --force` to refresh them from upstream sources. conftest.py =========== conftest.py """Pytest fixtures for documentation testing.""" from typing import Any as AnyType import pytest @pytest.fixture() def Any() -> AnyType: # noqa """For to be used in documentation.""" return AnyType docker-compose.yml ================== docker-compose.yml services: tox: build: . volumes: - ./htmlcov:/app/htmlcov pyproject.toml ============== pyproject.toml [project] name = "licence-normaliser" description = "Comprehensive license normalisation with a three-level hierarchy." readme = "README.rst" version = "0.3.2" requires-python = ">=3.10" dependencies = [] authors = [ { name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" }, ] maintainers = [ { name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" }, ] license = "MIT" classifiers = [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Operating System :: OS Independent", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", "Programming Language :: Python :: 3.12", "Programming Language :: Python :: 3.13", "Programming Language :: Python :: 3.14", "Programming Language :: Python :: 3.15", "Programming Language :: Python", "Topic :: Software Development :: Libraries :: Python Modules", ] keywords = [ "license", "normalisation", "spdx", "creative commons", "open source", ] [project.scripts] licence-normaliser = "licence_normaliser.cli:main" [project.urls] Homepage = "https://github.com/barseghyanartur/licence-normaliser/" Repository = "https://github.com/barseghyanartur/licence-normaliser/" Issues = "https://github.com/barseghyanartur/licence-normaliser/issues" [project.optional-dependencies] all = ["licence-normaliser[dev,test,docs,build]"] dev = [ "detect-secrets", "doc8", "ipython", "mypy", "ruff", "uv", ] test = [ "pytest", "pytest-cov", "pytest-codeblock", ] docs = [ "sphinx", "sphinx-autobuild", "sphinx-rtd-theme>=1.3.0", "sphinx-no-pragma", "sphinx-markdown-builder", "sphinx-llms-txt-link", "sphinx-source-tree", ] build = [ "build", "twine", "wheel", ] [tool.setuptools] package-dir = {"" = "src"} [tool.setuptools.packages.find] where = ["src"] include = ["licence_normaliser", "licence_normaliser.*"] [tool.setuptools.package-data] "licence_normaliser" = ["data/**/*.json"] [build-system] requires = ["setuptools>=41.0", "wheel"] build-backend = "setuptools.build_meta" [tool.ruff] line-length = 88 lint.select = [ "B", "C4", "E", "F", "G", "I", "ISC", "INP", "N", "PERF", "Q", "SIM", ] lint.ignore = [ "G004", "ISC003", ] fix = true src = ["src/licence_normaliser"] exclude = [ ".bzr", ".direnv", ".eggs", ".git", ".hg", ".mypy_cache", ".nox", ".pants.d", ".ruff_cache", ".svn", ".tox", ".venv", "__pypackages__", "_build", "buck-out", "build", "dist", "node_modules", "venv", "docs", ] target-version = "py310" lint.dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$" [tool.ruff.lint.isort] known-first-party = ["licence_normaliser"] [tool.ruff.lint.per-file-ignores] "conftest.py" = [ "PERF203" ] [tool.doc8] ignore-path = [ "docs/requirements.txt", "src/licence_normaliser.egg-info/SOURCES.txt", ] [tool.pytest.ini_options] addopts = [ "-ra", "-vvv", "-q", "--cov=licence_normaliser", "--ignore=.tox", "--cov-report=html", "--cov-report=term", "--cov-append", "--capture=no", ] testpaths = [ "src/licence_normaliser/tests", ".", "**/*.rst", "**/*.md", ] pythonpath = ["src"] norecursedirs = [".git", ".tox"] [tool.coverage.run] relative_files = true omit = [".tox/*"] source = ["licence_normaliser"] [tool.coverage.report] show_missing = true exclude_lines = [ "pragma: no cover", "@overload", ] [tool.mypy] check_untyped_defs = true warn_unused_ignores = true warn_redundant_casts = true warn_unused_configs = true ignore_missing_imports = true [tool.sphinx-source-tree] ignore = [ "*.egg-info", "*.py,cover", "*.pyc", "*.pyo", ".DS_Store", ".coverage", ".coverage.*", ".git", ".hg", ".hypothesis", ".idea", ".mypy_cache", ".nox", ".pre-commit-config.yaml", ".pre-commit-hooks.yaml", ".pytest_cache", ".readthedocs.yaml", ".ruff_cache", ".secrets.baseline", ".svn", ".tox", ".venv", ".vscode", "CHANGELOG.rst", "CODE_OF_CONDUCT.rst", "LICENSE", "SECURITY.rst", "Thumbs.db", "__pycache__", "build", "codebin", "dist", "docs/Makefile", "docs/_build", "docs/_static", "docs/changelog.rst", "docs/code_of_conduct.rst", "docs/make.bat", "docs/requirements.txt", "docs/security.rst", "docs/source_tree.rst", "docs/source_tree_full.rst", "env", "htmlcov", "node_modules", "venv", "ARCHITECTURE.rst", ".coderabbit.yaml", ".coveralls", "docs/full-llms.rst", "docs/llms.rst", "docs/contributor_guidelines.rst", "docs/package.rst", "docs/documentation.rst", "docs/index.rst", "uv.lock", "codebin", "src/licence_normaliser/data/creativecommons", "src/licence_normaliser/data/opendefinition", "src/licence_normaliser/data/osi", "src/licence_normaliser/data/scancode_licensedb", "src/licence_normaliser/data/spdx", ] order = [ "README.rst", "CONTRIBUTING.rst", "AGENTS.md", ] [[tool.sphinx-source-tree.files]] output = "docs/full_llms.rst" title = "Full project source-tree" [[tool.sphinx-source-tree.files]] output = "docs/llms.rst" title = "Project source-tree" ignore = [ "*.egg-info", "*.py,cover", "*.pyc", "*.pyo", ".DS_Store", ".coverage", ".coverage.*", ".git", ".hg", ".hypothesis", ".idea", ".mypy_cache", ".nox", ".pre-commit-config.yaml", ".pre-commit-hooks.yaml", ".pytest_cache", ".readthedocs.yaml", ".ruff_cache", ".secrets.baseline", ".svn", ".tox", ".venv", ".vscode", "CHANGELOG.rst", "CODE_OF_CONDUCT.rst", "LICENSE", "SECURITY.rst", "Thumbs.db", "__pycache__", "build", "codebin", "dist", "docs/Makefile", "docs/_build", "docs/_static", "docs/changelog.rst", "docs/code_of_conduct.rst", "docs/make.bat", "docs/requirements.txt", "docs/security.rst", "docs/source_tree.rst", "docs/source_tree_full.rst", "env", "htmlcov", "node_modules", "venv", "examples", "docs", "ARCHITECTURE.rst", ".coderabbit.yaml", ".coveralls", "docs/full-llms.rst", "docs/llms.rst", "docs/contributor_guidelines.rst", "docs/package.rst", "docs/documentation.rst", "docs/index.rst", "uv.lock", "src/licence_normaliser/data/creativecommons", "src/licence_normaliser/data/opendefinition", "src/licence_normaliser/data/osi", "src/licence_normaliser/data/scancode_licensedb", "src/licence_normaliser/data/spdx", ] scripts/README.rst ================== scripts/README.rst Scripts ======= Sort aliases ------------ Sorts ``aliases.json`` keys alphabetically. Comment keys (starting with ``_``) are preserved at the top in their original order. All other entries are sorted case-insensitively. .. code-block:: sh uv run python scripts/sort_aliases.py uv run python scripts/sort_aliases.py --check # exit 1 if not sorted Find alias duplicates --------------------- Finds duplicate ``version_key`` entries in ``aliases.json``. A "duplicate" is when two or more top-level primary keys share the same ``version_key``. Reports groups with more than one member. Can optionally fix duplicates by merging them into the ``aliases`` list of a single canonical entry. .. code-block:: sh uv run python scripts/find_alias_duplicates.py uv run python scripts/find_alias_duplicates.py --fix # interactive fix uv run python scripts/find_alias_duplicates.py --noinput # auto-apply safe fixes Apply aliases patch ------------------- Applies curated additions to ``aliases.json``. Adds an ``aliases`` list to existing CC version-free entries and adds new top-level entries for GPL shorthand keys that currently fall through to the unknown fallback. .. code-block:: sh uv run python scripts/apply_aliases_patch.py Compare datasets ---------------- Compares SPDX, OpenDefinition, OSI, CreativeCommons, ScanCode, and curated data files (aliases, url_map, prose, publishers). .. code-block:: sh uv run python scripts/compare_datasets.py Check missing aliases --------------------- Checks which licenses downloaded from the internet (via refreshable plugins) have corresponding entries in the curated ``aliases.json`` file. .. code-block:: sh uv run python scripts/check_missing_aliases.py uv run python scripts/check_missing_aliases.py --json # JSON output Test name inference ------------------- Assesses the accuracy of heuristic name stripping against curated name_key values from aliases.json. Shows how well automatic name extraction works for different license families (CC, copyleft, OSI, etc.). .. code-block:: sh uv run python scripts/test_name_inference.py uv run python scripts/test_name_inference.py --json # JSON output uv run python scripts/test_name_inference.py --details # Detailed breakdown scripts/__init__.py =================== scripts/__init__.py scripts/check_missing_aliases.py ================================ scripts/check_missing_aliases.py """Check which downloaded licenses are missing from curated aliases. Compares all refreshable plugin registries against aliases.json to identify licenses that have no corresponding curated alias entry. Usage: uv run python scripts/check_missing_aliases.py uv run python scripts/check_missing_aliases.py --json """ from __future__ import annotations import contextlib import json import sys from pathlib import Path DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data" SCRIPTS_DIR = Path(__file__).parent def load_alias_targets() -> set[str]: """Load all version_keys from aliases.json.""" with open(DATA_DIR / "aliases" / "aliases.json") as f: data = json.load(f) targets: set[str] = set() for meta in data.values(): if isinstance(meta, dict): vk = meta.get("version_key", "") if vk: targets.add(vk) return targets def load_downloaded_licenses() -> dict[str, set[str]]: """Load licenses from all refreshable plugins.""" from licence_normaliser.defaults import get_all_refreshable_plugins result: dict[str, set[str]] = {} for plugin_cls in get_all_refreshable_plugins(): # Try to load registry data = None with contextlib.suppress(Exception): data = plugin_cls().load_registry() if data: result[plugin_cls.__name__] = set(data.keys()) return result def check_coverage() -> dict: """Check which downloaded licenses have alias entries.""" alias_targets = load_alias_targets() downloaded = load_downloaded_licenses() all_downloaded: set[str] = set() for licenses in downloaded.values(): all_downloaded.update(licenses) # Categorize with_alias = all_downloaded & alias_targets without_alias = all_downloaded - alias_targets return { "total_downloaded": len(all_downloaded), "total_alias_targets": len(alias_targets), "with_alias": sorted(with_alias), "without_alias": sorted(without_alias), "coverage_percent": round(len(with_alias) / len(all_downloaded) * 100, 1) if all_downloaded else 0, "by_source": { name: { "total": len(licenses), "with_alias": len(licenses & alias_targets), "without_alias": sorted(licenses - alias_targets), "coverage": round( len(licenses & alias_targets) / len(licenses) * 100, 1 ) if licenses else 0, } for name, licenses in downloaded.items() }, } def group_by_prefix(licenses: list[str]) -> dict[str, list[str]]: """Group licenses by common prefixes.""" groups: dict[str, list[str]] = {} prefixes = [ "gpl-", "agpl-", "lgpl-", "apache-", "mpl-", "mit", "bsd", "cc-", "unlicense", "zlib", "isc", ] for prefix in prefixes: matches = sorted([lic for lic in licenses if lic.startswith(prefix)]) if matches: groups[prefix.rstrip("-") or "mit"] = matches licenses = [lic for lic in licenses if not lic.startswith(prefix)] if licenses: groups["other"] = sorted(licenses) return groups def print_report(data: dict) -> None: """Print text table report.""" print("=" * 70) print("Coverage Report: Downloaded Licenses vs Curated Aliases") print("=" * 70) print() print(f"Total downloaded: {data['total_downloaded']}") print(f"Total alias targets: {data['total_alias_targets']}") print(f"Coverage: {data['coverage_percent']}%") print() print("-" * 70) print("By Source:") print("-" * 70) print(f"{'Source':<30} {'Total':>8} {'With':>8} {'Without':>8} {'Coverage':>10}") print("-" * 70) for source, stats in data["by_source"].items(): print( f"{source:<30} {stats['total']:>8} " f"{stats['with_alias']:>8} {len(stats['without_alias']):>8} " f"{stats['coverage']:>9.1f}%" ) print() print("=" * 70) print(f"Missing Aliases ({len(data['without_alias'])} licenses)") print("=" * 70) groups = group_by_prefix(data["without_alias"].copy()) for group_name, licenses in groups.items(): if group_name == "other": print() print(f"All other licenses ({len(licenses)}):") else: print() print(f"{group_name.upper()} ({len(licenses)}):") for lic in licenses: print(f" {lic}") print() def main() -> None: json_export = "--json" in sys.argv data = check_coverage() if json_export: print(json.dumps(data, indent=2)) else: print_report(data) if __name__ == "__main__": main() scripts/compare_datasets.py =========================== scripts/compare_datasets.py """Dataset comparison tool for licence-normaliser. Compares SPDX, OpenDefinition, OSI, CreativeCommons, ScanCode, and curated data files (aliases, url_map, prose, publishers) for: - Dataset sizes - Cross-dataset overlaps - Licenses present in OSI but missing from SPDX - Orphan alias/URL targets (don't resolve to REGISTRY entries) - REGISTRY entries without curated aliases - Most-aliased license targets """ from __future__ import annotations __all__ = () import json from collections import Counter from pathlib import Path DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data" def load_spdx_ids() -> set[str]: with open(DATA_DIR / "spdx" / "spdx.json") as f: data = json.load(f) return {entry["licenseId"] for entry in data["licenses"]} def load_od_ids() -> set[str]: with open(DATA_DIR / "opendefinition" / "opendefinition.json") as f: data = json.load(f) return set(data.keys()) def load_osi_ids() -> set[str]: with open(DATA_DIR / "osi" / "osi.json") as f: data = json.load(f) return {entry["spdx_id"].strip() for entry in data if entry.get("spdx_id")} def load_cc_ids() -> set[str]: with open(DATA_DIR / "creativecommons" / "creativecommons.json") as f: data = json.load(f) return {entry["license_key"] for entry in data} def load_sc_ids() -> set[str]: with open(DATA_DIR / "scancode_licensedb" / "scancode_licensedb.json") as f: data = json.load(f) return {entry["license_key"] for entry in data} def load_alias_keys() -> set[str]: with open(DATA_DIR / "aliases" / "aliases.json") as f: data = json.load(f) return {k for k in data if not k.startswith("_")} def load_alias_targets() -> dict[str, str]: with open(DATA_DIR / "aliases" / "aliases.json") as f: data = json.load(f) return { k: v.get("version_key", "") for k, v in data.items() if not k.startswith("_") } def load_url_keys() -> set[str]: with open(DATA_DIR / "urls" / "url_map.json") as f: data = json.load(f) return {k for k in data if not k.startswith("_")} def load_url_targets() -> dict[str, str]: with open(DATA_DIR / "urls" / "url_map.json") as f: data = json.load(f) return { k: v.get("version_key", "") for k, v in data.items() if not k.startswith("_") } def load_prose_targets() -> list[str]: with open(DATA_DIR / "prose" / "prose_patterns.json") as f: data = json.load(f) return [entry.get("version_key", "") for entry in data] def load_pub_urls() -> set[str]: with open(DATA_DIR / "publishers" / "publishers.json") as f: data = json.load(f) return set(data.get("urls", {}).keys()) def load_pub_aliases() -> dict[str, str]: with open(DATA_DIR / "publishers" / "publishers.json") as f: data = json.load(f) return dict(data.get("shorthand_aliases", {})) def load_registry_keys() -> set[str]: from licence_normaliser._cache import get_registry_keys return get_registry_keys() def load_merged_aliases() -> dict[str, str]: """Simulate merged ALIASES: alias_key -> version_key from all curated sources.""" merged: dict[str, str] = {} merged.update(load_alias_targets()) merged.update(load_pub_aliases()) for k, v in load_url_targets().items(): if k not in merged: merged[k] = v return merged def would_resolve(alias_key: str, registry: set[str], aliases: dict[str, str]) -> bool: """Simulate _resolve() pipeline for orphan detection. 1. If already in REGISTRY, covered. 2. If in ALIASES, get version_key - resolves regardless of registry presence. """ if alias_key in registry: return True version_key = aliases.get(alias_key, "") return bool(version_key) def section(title: str) -> None: print(f"\n{'=' * 60}") print(f" {title}") print(f"{'=' * 60}") def main() -> None: print("Loading datasets...") spdx = load_spdx_ids() od = load_od_ids() osi = load_osi_ids() cc = load_cc_ids() sc = load_sc_ids() alias_keys = load_alias_keys() alias_tgt = load_alias_targets() url_keys = load_url_keys() url_tgt = load_url_targets() prose_tgt = load_prose_targets() pub_urls = load_pub_urls() pub_aliases = load_pub_aliases() registry = load_registry_keys() merged_aliases = load_merged_aliases() # --- 1. Dataset sizes --- section("Dataset Sizes") print(f" SPDX licenses: {len(spdx):>6}") print(f" OpenDefinition entries: {len(od):>6}") print(f" OSI-approved (SPDX): {len(osi):>6}") print(f" CreativeCommons: {len(cc):>6}") print(f" ScanCode DB entries: {len(sc):>6}") print(f" Aliases (curated): {len(alias_keys):>6}") print(f" URL mappings (curated): {len(url_keys):>6}") print(f" Prose patterns: {len(prose_tgt):>6}") print(f" Publisher URLs: {len(pub_urls):>6}") print(f" Publisher aliases: {len(pub_aliases):>6}") print(f" REGISTRY entries: {len(registry):>6}") # --- 2. Overlaps --- section("Cross-Dataset Overlaps") # SPDX overlaps def pct(sub: int, total: int) -> str: return f"{100 * sub / max(total, 1):.1f}%" overlaps = [ ("SPDX n OSI", len(spdx & osi), len(osi), "OSI"), ("SPDX n OD", len(spdx & od), len(od), "OD"), ("SPDX n CC", len(spdx & cc), len(cc), "CC"), ("OSI n OD", len(osi & od), len(od), "OD"), ("OSI n CC", len(osi & cc), len(cc), "CC"), ("OD n CC", len(od & cc), len(cc), "CC"), ("ScanCode n SPDX", len(sc & spdx), len(sc), "ScanCode"), ("ScanCode n OSI", len(sc & osi), len(sc), "ScanCode"), ] for label, overlap_count, total_count, pct_label in overlaps: ratio = pct(overlap_count, total_count) print(f" {label:<17} {overlap_count:>5} ({ratio} of {pct_label})") # Unique content print(f"\n Unique to SPDX: {len(spdx - od - osi - cc - sc):>6}") print(f" Unique to OD: {len(od - spdx):>6}") print(f" Unique to OSI: {len(osi - spdx):>6} (OSI IDs not in SPDX)") print(f" Unique to CC: {len(cc - spdx - od):>6}") print(f" Unique to ScanCode: {len(sc - spdx - osi - od - cc):>6}") # --- 3. OSI licenses not in SPDX (reference integrity) --- section("OSI Licenses Missing from SPDX") osi_only = sorted(osi - spdx) if osi_only: print(f" {len(osi_only)} OSI-licensed IDs have no SPDX entry:") for lid in osi_only[:20]: print(f" {lid}") if len(osi_only) > 20: print(f" ... and {len(osi_only) - 20} more") else: print(" All OSI IDs are present in SPDX.") # --- 4. Curated targets not in REGISTRY --- section("Curated Targets Missing from REGISTRY") orphan_alias = sorted( k for k in alias_keys if not would_resolve(k, registry, merged_aliases) ) orphan_url = sorted( k for k in url_keys if not would_resolve(k, registry, merged_aliases) ) orphan_pub = sorted( k for k in pub_aliases if not would_resolve(k, registry, merged_aliases) ) if orphan_alias: print(f" Alias keys that fail resolution ({len(orphan_alias)}):") for k in orphan_alias[:10]: print(f" {k!r} -> {alias_tgt.get(k, '')!r}") if len(orphan_alias) > 10: print(f" ... and {len(orphan_alias) - 10} more") else: print(" All alias keys resolve to REGISTRY entries.") if orphan_url: print(f"\n URL keys that fail resolution ({len(orphan_url)}):") for k in orphan_url[:10]: print(f" {k[:60]!r} -> {url_tgt.get(k, '')!r}") if len(orphan_url) > 10: print(f" ... and {len(orphan_url) - 10} more") if orphan_pub: print(f"\n Publisher aliases that fail resolution ({len(orphan_pub)}):") for k in orphan_pub[:10]: print(f" {k!r} -> {pub_aliases[k]!r}") if len(orphan_pub) > 10: print(f" ... and {len(orphan_pub) - 10} more") print( "\n (Note: prose pattern version_keys are often bare name_keys like " "'cc-by'; these resolve via the prose pipeline and are not orphans.)" ) # --- 5. REGISTRY entries not covered by curated data --- section("REGISTRY Entries Without Curated Mapping") covered = ( set(alias_tgt.values()) | set(url_tgt.values()) | set(pub_aliases.values()) ) uncovered = sorted(k for k in registry if k not in covered) if uncovered: print(f" {len(uncovered)} REGISTRY keys have no curated alias/URL mapping:") for k in uncovered[:20]: print(f" {k}") if len(uncovered) > 20: print(f" ... and {len(uncovered) - 20} more") else: print(" All REGISTRY entries have at least one curated mapping.") # --- 6. Duplicate alias keys (same key -> different targets) --- section("Duplicate Keys in Alias / URL Data Files") # Check if any key maps to different targets across aliases + url_map # (keys are unique within each file, so cross-file check) cross_keys = alias_keys & url_keys if cross_keys: print(f" Keys in both aliases.json AND url_map.json ({len(cross_keys)}):") for k in sorted(cross_keys): print(f" {k!r}: aliases={alias_tgt[k]!r}, url_map={url_tgt[k]!r}") # --- 7. Alias target frequency (which targets have the most aliases) --- section("Most-Aliased License Targets") alias_counts = Counter(alias_tgt.values()) url_counts = Counter(url_tgt.values()) pub_counts = Counter(pub_aliases.values()) combined = alias_counts + url_counts + pub_counts for target, count in combined.most_common(15): parts = [] if alias_counts[target]: parts.append(f"alias={alias_counts[target]}") if url_counts[target]: parts.append(f"url={url_counts[target]}") if pub_counts[target]: parts.append(f"pub={pub_counts[target]}") print(f" {target:<30} total={count:<4} ({', '.join(parts)})") # --- 8. Summary --- section("Summary") distinct = len(spdx | od | osi | cc | sc) orphans = len(orphan_alias) + len(orphan_url) + len(orphan_pub) print(f" Distinct license IDs: {distinct}") print(f" Curated alias entries: {len(alias_keys)}") print(f" Curated URL mappings: {len(url_keys)}") print(f" Orphan curated targets: {orphans}") print(f" OSI IDs missing SPDX: {len(osi_only)}") covered_count = len(registry) - len(uncovered) print(f" REGISTRY entries covered: {covered_count}/{len(registry)}") if __name__ == "__main__": main() scripts/test_name_inference.py ============================== scripts/test_name_inference.py """Test name inference accuracy against curated aliases. Compares heuristic name stripping against curated name_key values from aliases.json to assess how well automatic name extraction works. Usage: uv run python scripts/test_name_inference.py uv run python scripts/test_name_inference.py --json uv run python scripts/test_name_inference.py --json --incorrect-only uv run python scripts/test_name_inference.py --json --details """ from __future__ import annotations import json import sys from pathlib import Path from licence_normaliser import LicenseNormaliser DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data" SCRIPTS_DIR = Path(__file__).parent _normaliser = LicenseNormaliser() def load_name_mappings() -> dict[str, str]: """Load version_key -> name_key mappings from aliases.json.""" with open(DATA_DIR / "aliases" / "aliases.json") as f: data = json.load(f) mappings: dict[str, str] = {} for meta in data.values(): if isinstance(meta, dict): vk = meta.get("version_key", "") nk = meta.get("name_key", "") if vk and nk: mappings[vk] = nk return mappings def infer_name_heuristic(version_key: str) -> str: """Delegate to the core LicenseNormaliser's _infer_name method.""" return _normaliser._infer_name(version_key) def categorize_by_family(mappings: dict[str, str]) -> dict[str, dict[str, str]]: """Categorize licenses by inferred family.""" categories: dict[str, dict[str, str]] = { "cc": {}, # Creative Commons "copyleft": {}, # GPL/AGPL/LGPL "osi": {}, # OSI-approved "other": {}, } for vk, nk in mappings.items(): if vk.startswith("cc-"): categories["cc"][vk] = nk elif vk.startswith(("gpl-", "agpl-", "lgpl-")): categories["copyleft"][vk] = nk elif vk.startswith( ("mpl-", "apache-", "bsd-", "mit", "isc", "unlicense", "zlib") ): categories["osi"][vk] = nk else: categories["other"][vk] = nk return categories def assess_accuracy() -> dict: """Assess name inference accuracy.""" mappings = load_name_mappings() categories = categorize_by_family(mappings) results: dict = { "total_mappings": len(mappings), "by_family": {}, } for family, family_mappings in categories.items(): correct = 0 incorrect = 0 details: list[dict] = [] for vk, curated_nk in family_mappings.items(): inferred = infer_name_heuristic(vk) is_match = inferred == curated_nk if is_match: correct += 1 else: incorrect += 1 details.append( { "version_key": vk, "curated_name": curated_nk, "inferred_name": inferred, "match": is_match, } ) accuracy = ( round(correct / len(family_mappings) * 100, 1) if family_mappings else 0 ) results["by_family"][family] = { "total": len(family_mappings), "correct": correct, "incorrect": incorrect, "accuracy_percent": accuracy, "details": details, } # Overall accuracy all_correct = sum(r["correct"] for r in results["by_family"].values()) all_total = sum(r["total"] for r in results["by_family"].values()) results["overall_accuracy"] = ( round(all_correct / all_total * 100, 1) if all_total else 0 ) return results def print_report(data: dict) -> None: """Print text table report.""" print("=" * 70) print("Name Inference Accuracy Report") print("=" * 70) print() print(f"Total curated mappings: {data['total_mappings']}") print(f"Overall accuracy: {data['overall_accuracy']}%") print() print("-" * 70) print("By Family:") print("-" * 70) print( f"{'Family':<15} {'Total':>8} {'Correct':>8} {'Incorrect':>8} {'Accuracy':>10}" ) print("-" * 70) for family, stats in data["by_family"].items(): print( f"{family:<15} {stats['total']:>8} {stats['correct']:>8} " f"{stats['incorrect']:>8} {stats['accuracy_percent']:>9.1f}%" ) print() # Show some incorrect examples for family, stats in data["by_family"].items(): if stats["incorrect"] > 0: print("-" * 70) print(f"Incorrect in {family}: {stats['incorrect']} cases") print("-" * 70) print( f"{'Version Key':<30} {'Curated (aliases.json)':<25} " f"{'Inferred (heuristic)':<20}" ) print("-" * 70) for detail in stats["details"][:10]: if not detail["match"]: print( f"{detail['version_key']:<30} " f"{detail['curated_name']:<25} {detail['inferred_name']:<20}" ) incorrect_count = len([d for d in stats["details"] if not d["match"]]) if incorrect_count > 10: print(f"... and {incorrect_count - 10} more") print() def main() -> None: json_export = "--json" in sys.argv incorrect_only = "--incorrect-only" in sys.argv include_details = "--details" in sys.argv data = assess_accuracy() if json_export: for family in data["by_family"]: details = data["by_family"][family].get("details", []) if incorrect_only: data["by_family"][family]["details"] = [ d for d in details if not d["match"] ] elif not include_details: data["by_family"][family].pop("details", None) print(json.dumps(data, indent=2)) else: print_report(data) if __name__ == "__main__": main() src/licence_normaliser/__init__.py ================================== src/licence_normaliser/__init__.py """licence_normaliser - License normalisation with a three-level hierarchy.""" from ._core import ( LicenseFamily, LicenseName, LicenseVersion, normalise_license, normalise_licenses, ) from ._normaliser import LicenseNormaliser from ._trace import LicenseTrace, LicenseTraceStage from .exceptions import LicenseNormalisationError, LicenseNotFoundError __title__ = "licence-normaliser" __version__ = "0.3.2" __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ( "LicenseFamily", "LicenseName", "LicenseVersion", "LicenseNormaliser", "LicenseNormalisationError", "LicenseNotFoundError", "LicenseTrace", "LicenseTraceStage", "normalise_license", "normalise_licenses", ) src/licence_normaliser/_cache.py ================================ src/licence_normaliser/_cache.py """Caching layer + strict mode - delegates to LicenseNormaliser with defaults.""" from __future__ import annotations from threading import Lock from typing import Iterable from ._models import LicenseVersion from ._normaliser import LicenseNormaliser __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ( "_default", "get_registry_keys", "normalise_license", "normalise_licenses", ) class _DefaultNormaliser: """Thread-safe lazy singleton for the default LicenseNormaliser instance.""" _instance: LicenseNormaliser | None = None _lock: Lock = Lock() def get(self) -> LicenseNormaliser: if _DefaultNormaliser._instance is None: with _DefaultNormaliser._lock: if _DefaultNormaliser._instance is None: _DefaultNormaliser._instance = LicenseNormaliser() return _DefaultNormaliser._instance _default = _DefaultNormaliser() def normalise_license( raw: str, *, strict: bool = False, trace: bool | None = None ) -> LicenseVersion: """Public API with optional strict mode and trace.""" return _default.get().normalise_license(raw, strict=strict, trace=trace) def normalise_licenses( raws: Iterable[str], *, strict: bool = False, trace: bool | None = None ) -> list[LicenseVersion]: """Batch version with optional trace.""" return _default.get().normalise_licenses(raws, strict=strict, trace=trace) def get_registry_keys() -> set[str]: """Return the set of all known registry keys from the runtime normaliser.""" return _default.get().registry_keys() src/licence_normaliser/_core.py =============================== src/licence_normaliser/_core.py """License Normaliser - public orchestration shim.""" from __future__ import annotations from ._cache import normalise_license, normalise_licenses from ._models import LicenseFamily, LicenseName, LicenseVersion __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ( "LicenseFamily", "LicenseName", "LicenseVersion", "normalise_license", "normalise_licenses", ) src/licence_normaliser/_models.py ================================= src/licence_normaliser/_models.py """License data models - frozen dataclasses for the three-level hierarchy.""" from __future__ import annotations from dataclasses import dataclass, field from typing import Optional __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ( "LicenseFamily", "LicenseName", "LicenseVersion", ) @dataclass(frozen=True, slots=True) class LicenseFamily: key: str def __str__(self) -> str: return self.key def __repr__(self) -> str: return f"LicenseFamily({self.key!r})" def __eq__(self, other: object) -> bool: if isinstance(other, LicenseFamily): return self.key == other.key if isinstance(other, str): return self.key == other return NotImplemented def __hash__(self) -> int: return hash(self.key) @dataclass(frozen=True, slots=True) class LicenseName: key: str family: LicenseFamily def __str__(self) -> str: return self.key def __repr__(self) -> str: return f"LicenseName({self.key!r}, family={self.family.key!r})" def __eq__(self, other: object) -> bool: if isinstance(other, LicenseName): return self.key == other.key if isinstance(other, str): return self.key == other return NotImplemented def __hash__(self) -> int: return hash(self.key) @dataclass(frozen=True, slots=True) class LicenseVersion: key: str url: Optional[str] license: LicenseName _trace: Optional[object] = field(default=None, repr=False) @property def family(self) -> LicenseFamily: return self.license.family def __str__(self) -> str: return self.key def __repr__(self) -> str: return ( f"LicenseVersion(key={self.key!r}, " f"license={self.license.key!r}, " f"family={self.license.family.key!r})" ) def __eq__(self, other: object) -> bool: if isinstance(other, LicenseVersion): return self.key == other.key if isinstance(other, str): return self.key == other return NotImplemented def __hash__(self) -> int: return hash(self.key) def explain(self) -> str: """Return explanation of how this license was resolved. Set ENABLE_LICENCE_NORMALISER_TRACE=1 to enable tracing, or pass trace=True to normalise_license(). """ if self._trace is not None: return str(self._trace) from licence_normaliser._cache import _default from licence_normaliser._trace import _should_trace if not _should_trace(): return "Trace disabled. Set ENABLE_LICENCE_NORMALISER_TRACE=1 to enable." ln = _default.get() cleaned = ln._clean(ln._try_decode_mojibake(self.key)) result = ln._resolve_with_trace(self.key, cleaned, strict=False) trace = result._trace return str(trace) if trace else "No trace available." src/licence_normaliser/_normaliser.py ===================================== src/licence_normaliser/_normaliser.py """Plugin-based LicenseNormaliser class with configurable constructor injection.""" from __future__ import annotations import re from functools import lru_cache from typing import TYPE_CHECKING, Iterable, Sequence from licence_normaliser.defaults import ( get_default_alias, get_default_family, get_default_name, get_default_prose, get_default_registry, get_default_url, ) if TYPE_CHECKING: from licence_normaliser._models import LicenseVersion from licence_normaliser._trace import LicenseTrace __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("LicenseNormaliser",) _WHITESPACE_RE = re.compile(r"\s+") _MAX_INPUT = 4096 class LicenseNormaliser: """Configurable license normalisation with plugin-based data sources. Plugins are passed as CLASSES (not instances). They're instantiated lazily when their load_* method is called. Six plugin types are supported (each returns specific data structure): - registry: key -> canonical_key - url: cleaned_url -> version_key - alias: alias_string -> version_key - family: version_key -> family_key - name: version_key -> name_key - prose: list of (compiled_pattern, version_key) Resolution order: aliases -> registry -> url -> prose -> unknown Name/family inference: plugins only - no fallback to hardcoded logic. Tracing Set ``trace=True`` to include resolution trace in the result. Trace shows which pipeline stage matched and the source file/line number (when available). Trace is disabled by default for performance. Trace can be enabled at three levels (precedence: method > constructor > env var): - **Constructor**: ``LicenseNormaliser(trace=True)`` - all calls get trace - **Method**: ``ln.normalise_license("MIT", trace=True)`` - this call only - **Environment**: ``ENABLE_LICENCE_NORMALISER_TRACE=1`` - applies globally Example:: from licence_normaliser import LicenseNormaliser # Uses all defaults automatically ln = LicenseNormaliser() # Disable caching for debugging ln = LicenseNormaliser(cache=False) # Enable trace for all calls on this instance ln = LicenseNormaliser(trace=True) v = ln.normalise_license("MIT") print(v.explain()) # Shows resolution path with source lines # Or enable trace for a single call v = ln.normalise_license("MIT", trace=True) """ def __init__( self, *, registry: Sequence[type] | None = None, url: Sequence[type] | None = None, alias: Sequence[type] | None = None, family: Sequence[type] | None = None, name: Sequence[type] | None = None, prose: Sequence[type] | None = None, cache: bool = True, cache_maxsize: int = 8192, trace: bool | None = None, ) -> None: self._registry: dict[str, str] = {} self._url_map: dict[str, str] = {} self._url_to_vkey: dict[str, str] = {} self._aliases: dict[str, str] = {} self._alias_lines: dict[str, tuple[str, int]] = {} self._publisher_alias_lines: dict[str, tuple[str, int]] = {} self._publisher_url_lines: dict[str, tuple[str, int]] = {} self._prose_lines: list[tuple[re.Pattern[str], str, int]] = [] self._alias_lines_loaded: bool = False self._family_overrides: dict[str, str] = {} self._name_overrides: dict[str, str] = {} self._prose_patterns: list[tuple[re.Pattern[str], str]] = [] self._cache = cache self._cache_maxsize = cache_maxsize self._trace_default = trace # Load plugins - use defaults if not explicitly provided registry = registry or get_default_registry() url = url or get_default_url() alias = alias or get_default_alias() family = family or get_default_family() name = name or get_default_name() prose = prose or get_default_prose() # Store plugin lists for trace resolution self._alias_plugins = alias self._url_plugins = url self._prose_plugins = prose # Instantiate plugins and load their data for plugin_cls in registry: data = plugin_cls().load_registry() self._registry.update(data) for plugin_cls in url: data = plugin_cls().load_urls() self._url_map.update(data) # Build inverted URL map: version_key -> cleaned_url (for LicenseVersion.url) self._url_to_vkey = {v: k for k, v in self._url_map.items()} for plugin_cls in alias: data = plugin_cls().load_aliases() self._aliases.update(data) for plugin_cls in family: data = plugin_cls().load_families() self._family_overrides.update(data) for plugin_cls in name: data = plugin_cls().load_names() self._name_overrides.update(data) for plugin_cls in prose: patterns = plugin_cls().load_prose() self._prose_patterns.extend(patterns) # Set up cached resolution if self._cache: resolve_fn = lru_cache(maxsize=self._cache_maxsize)(self._resolve_impl) # type: ignore[assignment] self._resolve_impl = resolve_fn def _get_trace_mode(self, trace: bool | None) -> bool: """Determine if tracing is enabled: explicit > env var > default.""" from licence_normaliser._trace import _should_trace if trace is not None: return trace if self._trace_default is not None: return self._trace_default return _should_trace() def _load_alias_lines(self): """Lazy load all source line numbers on first trace request.""" for plugin_cls in self._alias_plugins: if hasattr(plugin_cls, "load_aliases_with_lines"): lines_data = plugin_cls().load_aliases_with_lines() for alias_key, (version_key, line_num) in lines_data.items(): if version_key == self._aliases.get(alias_key): self._alias_lines[alias_key] = (version_key, line_num) for plugin_cls in self._alias_plugins: if hasattr(plugin_cls, "load_aliases_with_lines"): lines_data = plugin_cls().load_aliases_with_lines() for alias_key, (version_key, line_num) in lines_data.items(): if ( version_key == self._aliases.get(alias_key) and alias_key not in self._alias_lines ): self._alias_lines[alias_key] = (version_key, line_num) for plugin_cls in self._url_plugins: if hasattr(plugin_cls, "load_aliases_with_lines"): lines_data = plugin_cls().load_aliases_with_lines() for alias_key, (version_key, line_num) in lines_data.items(): if version_key == self._aliases.get(alias_key): self._publisher_alias_lines[alias_key] = (version_key, line_num) for plugin_cls in self._url_plugins: if hasattr(plugin_cls, "load_urls_with_lines"): lines_data = plugin_cls().load_urls_with_lines() for url_key, (version_key, line_num) in lines_data.items(): if version_key == self._url_map.get(url_key): self._publisher_url_lines[url_key] = (version_key, line_num) for plugin_cls in self._prose_plugins: if hasattr(plugin_cls, "load_prose_with_lines"): lines_data = plugin_cls().load_prose_with_lines() self._prose_lines.extend(lines_data) def _resolve_with_trace( self, raw: str, cleaned: str, strict: bool ) -> LicenseVersion: """Resolve with full pipeline tracing.""" from licence_normaliser._trace import LicenseTrace, LicenseTraceStage # Lazy load alias lines on first trace call if not self._alias_lines_loaded: self._load_alias_lines() self._alias_lines_loaded = True stages: list[LicenseTraceStage] = [] # 1. Alias lookup if cleaned in self._aliases: output = self._aliases[cleaned] source_line = None source_file = None if cleaned in self._alias_lines: _, source_line = self._alias_lines[cleaned] source_file = "aliases.json" stages.append( LicenseTraceStage( "alias", cleaned, output, True, source_line, source_file ) ) v = self._make(output) trace = LicenseTrace( raw, cleaned, stages, version_key=v.key, name_key=v.license.key, family_key=v.family.key, ) return self._make_with_trace(v, trace) stages.append(LicenseTraceStage("alias", cleaned, "", False)) # 2. Registry lookup if cleaned in self._registry: canonical = self._registry[cleaned] stages.append(LicenseTraceStage("registry", cleaned, canonical, True)) v = self._make(canonical) trace = LicenseTrace( raw, cleaned, stages, version_key=v.key, name_key=v.license.key, family_key=v.family.key, ) return self._make_with_trace(v, trace) stages.append(LicenseTraceStage("registry", cleaned, "", False)) # 3. URL lookup url_key = self._normalise_url(cleaned) if url_key in self._url_map: resolved = self._url_map[url_key] source_line = None source_file = None if url_key in self._publisher_url_lines: _, source_line = self._publisher_url_lines[url_key] source_file = "publishers.json" stages.append( LicenseTraceStage( "url", url_key, resolved, True, source_line, source_file ) ) v = self._make(resolved) trace = LicenseTrace( raw, cleaned, stages, version_key=v.key, name_key=v.license.key, family_key=v.family.key, ) return self._make_with_trace(v, trace) stages.append(LicenseTraceStage("url", cleaned, "", False)) # 4. Prose matching (only for longer strings) if len(cleaned) >= 20: for i, (pattern, vkey) in enumerate(self._prose_patterns): if pattern.search(cleaned): source_line = None source_file = "prose_patterns.json" if self._prose_lines and i < len(self._prose_lines): _, _, source_line = self._prose_lines[i] stages.append( LicenseTraceStage( "prose", cleaned, vkey, True, source_line, source_file ) ) v = self._make(vkey) trace = LicenseTrace( raw, cleaned, stages, version_key=v.key, name_key=v.license.key, family_key=v.family.key, ) return self._make_with_trace(v, trace) stages.append(LicenseTraceStage("prose", cleaned, "", False)) # 5. Fallback to unknown stages.append(LicenseTraceStage("fallback", cleaned, cleaned, True)) v = self._make_unknown(cleaned) trace = LicenseTrace( raw, cleaned, stages, version_key=v.key, name_key=v.license.key, family_key=v.family.key, ) return self._make_with_trace(v, trace) def _make_with_trace( self, v: LicenseVersion, trace: LicenseTrace ) -> LicenseVersion: """Create a LicenseVersion with trace attached.""" # Reconstruct with trace using object.__setattr__ (frozen dataclass) object.__setattr__(v, "_trace", trace) return v def _resolve_impl(self, cleaned: str) -> LicenseVersion: # 1. Alias lookup if cleaned in self._aliases: return self._make(self._aliases[cleaned]) # 2. Registry lookup if cleaned in self._registry: canonical = self._registry[cleaned] return self._make(canonical) # 3. URL lookup url_key = self._normalise_url(cleaned) if url_key in self._url_map: return self._make(self._url_map[url_key]) # 4. Prose matching (only for longer strings) if len(cleaned) >= 20: for pattern, vkey in self._prose_patterns: if pattern.search(cleaned): return self._make(vkey) # 5. Fallback to unknown return self._make_unknown(cleaned) def normalise_license( self, raw: str, *, strict: bool = False, trace: bool | None = None ) -> LicenseVersion: """Normalise a single license string. Args: raw: The raw license string, SPDX ID, URL, or prose description. strict: If True, raises ``LicenseNotFoundError`` when the input cannot be resolved to a known license. trace: If True, include resolution trace showing which pipeline stage matched and source file/line. If None, uses the instance default (``trace`` param from constructor) or falls back to ``ENABLE_LICENCE_NORMALISER_TRACE`` env var. Returns: A ``LicenseVersion`` with the resolved key, license name, and family. Raises: LicenseNotFoundError: When ``strict=True`` and resolution fails. """ from licence_normaliser.exceptions import LicenseNotFoundError do_trace = self._get_trace_mode(trace) if not raw or not raw.strip(): cleaned = "unknown" v = self._make_unknown(cleaned) if do_trace: from licence_normaliser._trace import LicenseTrace, LicenseTraceStage stages = [LicenseTraceStage("fallback", cleaned, cleaned, True)] trace_obj = LicenseTrace( raw, cleaned, stages, version_key=v.key, name_key=v.license.key, family_key=v.family.key, ) v = self._make_with_trace(v, trace_obj) else: cleaned = self._clean(self._try_decode_mojibake(raw)) if do_trace: v = self._resolve_with_trace(raw, cleaned, strict) else: v = self._resolve_impl(cleaned) if strict and v.family.key == "unknown": raise LicenseNotFoundError(raw, v.key) from None return v def normalise_licenses( self, raws: Iterable[str], *, strict: bool = False, trace: bool | None = None ) -> list[LicenseVersion]: """Batch normalisation. When ``strict=True``, raises on the first failure. """ from licence_normaliser.exceptions import LicenseNotFoundError results: list[LicenseVersion] = [] for raw in raws: v = self.normalise_license(raw, strict=False, trace=trace) if strict and v.family.key == "unknown": raise LicenseNotFoundError(raw, v.key) from None results.append(v) return results def registry_keys(self) -> set[str]: """Return the set of all known registry keys.""" return set(self._registry.keys()) def _make(self, key: str) -> LicenseVersion: """Factory: build a LicenseVersion from a resolved version_key.""" from licence_normaliser._models import ( LicenseFamily, LicenseName, LicenseVersion, ) k = key.lower().strip() # Get canonical key from registry canonical = self._registry.get(k) or k # Get URL via inverted map: version_key -> cleaned_url url = self._url_to_vkey.get(canonical) or self._url_to_vkey.get(k) # Infer name: # - For CC licenses, use override only if it's different from canonical # - For non-CC (GPL, AGPL, OSI, etc.), always return canonical (no stripping) override_name = self._name_overrides.get(canonical) if canonical.startswith("cc-") or canonical.startswith("cc0"): # CC licenses: use override if present, otherwise fallback to _infer_name name_key = override_name if override_name else self._infer_name(canonical) else: # Non-CC: use override if present and different, otherwise canonical name_key = ( override_name if override_name and override_name != canonical else canonical ) # Infer family: use override only if it provides a different value override_family = self._family_overrides.get(canonical) family_key = ( override_family if override_family and override_family != canonical else self._infer_family(canonical) ) family = LicenseFamily(key=family_key) name = LicenseName(key=name_key, family=family) return LicenseVersion(key=canonical, url=url, license=name) def _make_unknown(self, key: str) -> LicenseVersion: """Factory: build an unknown LicenseVersion for unresolved input.""" from licence_normaliser._models import ( LicenseFamily, LicenseName, LicenseVersion, ) family = LicenseFamily(key="unknown") name = LicenseName(key=key, family=family) return LicenseVersion(key=key, url=None, license=name) def _infer_family(self, key: str) -> str: """Fallback family inference - only used if no plugin provides it.""" k = key.lower() if k.startswith("cc0"): return "cc0" if k.startswith("cc-pdm"): return "public-domain" if k.startswith("cc-"): return "cc" if k.startswith(("gpl-", "agpl-", "lgpl-")): return "copyleft" if k.startswith(("odbl", "odc-by")): return "open-data" if k.startswith(("pddl-", "odc-")): return "data" if k.startswith( ( "elsevier-oa", "acs-authorchoice", "acs-authorchoice-ccby", "acs-authorchoice-ccbyncnd", "acs-authorchoice-nih", "jama-cc-by", "thieme-nlm", "implied-oa", "unspecified-oa", "publisher-specific-oa", "author-manuscript", "oup-chorus", ) ): return "publisher-oa" if k.startswith( ( "elsevier-tdm", "wiley-tdm", "springer-tdm", "springernature-tdm", "iop-tdm", "aps-tdm", ) ): return "publisher-tdm" if k.startswith( ( "elsevier-", "wiley-", "springer-", "springernature-", "acs-", "rsc-", "iop-", "bmj-", "aaas-", "pnas-", "aps-", "cup-", "aip-", "jama-", "degruyter-", "oup-", "sage-", "tandf-", "thieme-", ) ): return "publisher-proprietary" if k in ("public-domain", "other-oa", "open-access"): return "public-domain" if k == "public-domain" else "other-oa" return "unknown" def _infer_name(self, key: str) -> str: """Fallback name inference - only used if no plugin provides it.""" k = key.lower() if k.startswith("cc0"): return "cc0" if k.startswith("cc-"): parts = k.split("-") for i, part in enumerate(parts): if part.replace(".", "").isdigit(): return "-".join(parts[:i]) return "-".join(parts[:2]) # For all other licenses (GPL, AGPL, OSI, etc.), keep the key as-is return k @staticmethod def _clean(raw: str) -> str: s = _WHITESPACE_RE.sub(" ", raw.strip().rstrip("/")).lower() return s[:_MAX_INPUT] @staticmethod def _try_decode_mojibake(s: str) -> str: try: return s.encode("latin-1").decode("utf-8") except (UnicodeEncodeError, UnicodeDecodeError): return s @staticmethod def _normalise_url(cleaned: str) -> str: key = cleaned.lower() if key.startswith("http://"): key = "https://" + key[7:] return key.rstrip("/") src/licence_normaliser/_trace.py ================================ src/licence_normaliser/_trace.py """License trace and explanation support.""" from __future__ import annotations import os from dataclasses import dataclass, field __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ( "TRACE_STAGES", "LicenseTrace", "LicenseTraceStage", ) TRACE_STAGES = ("alias", "registry", "url", "prose", "fallback") @dataclass class LicenseTraceStage: """Single stage in the license resolution pipeline.""" stage: str input: str output: str matched: bool source_line: int | None = None source_file: str | None = None @dataclass class LicenseTrace: """Complete trace of license resolution pipeline.""" raw_input: str cleaned_input: str stages: list[LicenseTraceStage] = field(default_factory=list) version_key: str = "" name_key: str = "" family_key: str = "" def __str__(self) -> str: lines = [f"Input: {self.raw_input!r} → {self.cleaned_input!r}"] for s in self.stages: status = "✓" if s.matched else "-" source_info = "" if s.source_line is not None: source_info = f" (line {s.source_line}" if s.source_file: source_info += f" in {s.source_file}" source_info += ")" lines.append( f" [{status}] {s.stage}: {s.input!r} → {s.output!r}{source_info}" ) lines.append("") lines.append("Result:") lines.append(f" version_key: {self.version_key!r}") lines.append(f" name_key: {self.name_key!r}") lines.append(f" family_key: {self.family_key!r}") return "\n".join(lines) def _should_trace() -> bool: """Check if tracing is enabled via environment variable.""" return os.environ.get("ENABLE_LICENCE_NORMALISER_TRACE", "").lower() in ( "1", "true", "yes", ) src/licence_normaliser/cli/__init__.py ====================================== src/licence_normaliser/cli/__init__.py """licence_normaliser.cli - command-line interface for licence-normaliser.""" from ._main import main __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("main",) src/licence_normaliser/cli/_main.py =================================== src/licence_normaliser/cli/_main.py """licence-normaliser CLI - license normalisation from the command line.""" import argparse import sys from pathlib import Path from licence_normaliser import __version__, normalise_license from licence_normaliser._trace import _should_trace from licence_normaliser.defaults import get_all_refreshable_plugins from licence_normaliser.exceptions import ( LicenseNormalisationError, LicenseNotFoundError, ) __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("main",) def _build_parser() -> argparse.ArgumentParser: parser = argparse.ArgumentParser( prog="licence-normaliser", description="Comprehensive license normalisation - three-level hierarchy.", ) parser.add_argument( "--version", action="version", version=f"%(prog)s {__version__}", ) sub = parser.add_subparsers(dest="command", required=True) norm = sub.add_parser("normalise", help="Normalise a license string.") norm.add_argument("license", help="License string to normalise.") norm.add_argument("--full", action="store_true") norm.add_argument("--strict", action="store_true") norm.add_argument("--trace", action="store_true", help="Show resolution trace.") batch = sub.add_parser("batch", help="Normalise multiple license strings.") batch.add_argument("licenses", nargs="+") batch.add_argument("--strict", action="store_true") batch.add_argument( "--trace", action="store_true", help="Show resolution trace for each." ) update = sub.add_parser( "update-data", help="Fetch fresh data from all registered parsers." ) update.add_argument( "--parser", dest="parser_name", metavar="NAME", help="Refresh only the named parser (e.g. spdx, opendefinition, osi). " "Without this flag, all parsers are refreshed.", ) update.add_argument( "--force", action="store_true", help="Overwrite even if the local file already exists.", ) return parser def _cmd_normalise(args: argparse.Namespace) -> int: try: trace = args.trace or _should_trace() result = normalise_license(args.license, strict=args.strict, trace=trace) if trace: print(result.explain()) elif args.full: print(f"Key: {result.key}") print(f"URL: {result.url or '(none)'}") print(f"License: {result.license}") print(f"Family: {result.family}") else: print(result.key) except LicenseNotFoundError as exc: print(f"error: {exc}", file=sys.stderr) return 1 except LicenseNormalisationError as exc: print(f"error: {exc}", file=sys.stderr) return 1 return 0 def _cmd_batch(args: argparse.Namespace) -> int: trace = args.trace or _should_trace() if args.strict: try: for license_str in args.licenses: result = normalise_license(license_str, strict=True, trace=trace) if trace: print(f"{license_str}:") print(result.explain()) else: print(f"{license_str}: {result.key}") except LicenseNotFoundError as exc: print(f"error: {exc}", file=sys.stderr) return 1 else: for license_str in args.licenses: result = normalise_license(license_str, strict=False, trace=trace) if trace: print(f"{license_str}:") print(result.explain()) else: print(f"{license_str}: {result.key}") return 0 def _cmd_update_data(args: argparse.Namespace) -> int: parser_classes = get_all_refreshable_plugins() if args.parser_name: parser_classes = [ p for p in parser_classes if getattr(p, "id", None) == args.parser_name ] if not parser_classes: available = [ getattr(p, "id", p.__name__) for p in get_all_refreshable_plugins() ] print( f"error: unknown parser {args.parser_name!r}. Available: {available}", file=sys.stderr, ) return 1 failed: list[str] = [] for parser_cls in parser_classes: name = getattr(parser_cls, "id", parser_cls.__name__) url = parser_cls.url target = parser_cls.local_path target_path = Path(__file__).parent.parent / target ok = parser_cls.refresh(args.force) if target_path.exists() and not args.force: status = "skipped" elif ok: status = "fetched" else: status = "FAILED" if not ok: failed.append(name) print(f" {status}: {name} ({url}) -> {target}") if failed: print(f"error: failed to refresh: {', '.join(failed)}", file=sys.stderr) return 1 print("Data sources updated successfully.") return 0 def main() -> None: parser = _build_parser() args = parser.parse_args() if args.command == "normalise": sys.exit(_cmd_normalise(args)) elif args.command == "batch": sys.exit(_cmd_batch(args)) elif args.command == "update-data": sys.exit(_cmd_update_data(args)) else: parser.print_help() sys.exit(1) src/licence_normaliser/data/README.rst ====================================== src/licence_normaliser/data/README.rst Data Directory ============== This directory contains all normalisation data files loaded at runtime by ``licence-normaliser``. You can extend or override entries without touching any Python code. Structure --------- :: data/ ├── aliases/ │ └── aliases.json # Alias string → metadata dict ├── urls/ │ └── url_map.json # Canonical URL → metadata dict ├── prose/ │ └── prose_patterns.json # Ordered regex patterns for long text scanning ├── publishers/ │ └── publishers.json # Publisher URLs and shorthand aliases ├── spdx/ │ └── spdx.json # SPDX license list (auto-refreshed) ├── opendefinition/ │ └── opendefinition.json # Open Definition list (auto-refreshed) ├── osi/ │ └── osi.json # OSI license list (auto-refreshed) ├── creativecommons/ │ └── creativecommons.json # CC licenses (scraped from creativecommons.org) └── scancode_licensedb/ └── scancode_licensedb.json # ScanCode license DB (auto-refreshed) Entry Format ------------ Every entry maps a **lookup key** (alias string, URL, or prose pattern) to a metadata dict with three required fields: - ``version_key`` – the canonical version-level identifier (e.g. ``"cc-by-4.0"``) - ``name_key`` – the name-level identifier without version suffix (e.g. ``"cc-by"``) - ``family_key`` – the family-level identifier (e.g. ``"cc"``) URLs are stored separately in the ``url`` field of the metadata dict. How to Add a New License Alias ------------------------------ Edit ``aliases/aliases.json``: .. code:: json { "my new alias": { "version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc" } } The key must be **lowercase and whitespace-collapsed**. How to Add a Publisher URL or Shorthand --------------------------------------- Edit ``publishers/publishers.json``: .. code:: json { "urls": { "https://example.com/my-license/": { "version_key": "my-license", "name_key": "my-license", "family_key": "publisher-oa" } }, "shorthand_aliases": { "my shorthand alias": "my-license" } } Both ``http://`` and ``https://`` URL variants may be listed; they are normalised at lookup time (http→https, trailing slash stripped). How to Add a New URL Mapping ---------------------------- Edit ``urls/url_map.json``: .. code:: json { "https://example.com/my-license/": { "version_key": "my-license", "name_key": "my-license", "family_key": "publisher-oa" } } How to Add a New Prose Pattern ------------------------------ Edit ``prose/prose_patterns.json`` — insert your entry **before** any pattern it should take priority over: .. code:: json [ {"pattern": "my very specific phrase", "version_key": "my-license", "name_key": "my-license", "family_key": "publisher-oa"}, ... ] Patterns are Python regular expressions matched case-insensitively. More-specific patterns must come first. How to Add a Brand-New License ------------------------------ 1. Add entries to one or more JSON data files (``aliases/aliases.json``, ``urls/url_map.json``, ``prose/prose_patterns.json``, or ``publishers/publishers.json``). Each entry maps a key to a dict with ``version_key``, ``name_key``, and ``family_key``. 2. If the ``family_key`` is not covered by the regex fallback table in ``_registry.py``, add an explicit ``family_key`` value in the JSON entry (recommended). 3. Run ``make test-env ENV=py312`` to verify. Updating SPDX or OpenDefinition ------------------------------- The ``licence-normaliser update-data`` CLI command fetches fresh upstream data: .. code:: sh licence-normaliser update-data --force This updates: - ``spdx/spdx.json`` — full `SPDX license list `_ - ``opendefinition/opendefinition.json`` — full `Open Definition list `_ - ``osi/osi.json`` — `OSI license list `_ - ``creativecommons/creativecommons.json`` — scraped from creativecommons.org - ``scancode_licensedb/scancode_licensedb.json`` — `ScanCode license DB `_ Family Override Files --------------------- Some entries carry an explicit ``family_key`` that overrides the inference logic in ``_registry.py``. These are stored in: - ``aliases/aliases.json`` — ``family_key`` on any alias entry populates ``FAMILY_OVERRIDES`` at import time. src/licence_normaliser/data/aliases/aliases.json ================================================ src/licence_normaliser/data/aliases/aliases.json { "_comment": "Curated alias map: cleaned-lowercase-string -> metadata dict.", "_comment2": "Keys must already be in cleaned form (lowercase, whitespace-collapsed).", "aaas reuse": { "version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary", "aliases": [ "aaas author reuse", "aaas-author-reuse", "science author reuse" ] }, "acs authorchoice": { "version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa", "aliases": [ "acs-authorchoice" ] }, "acs-authorchoice-ccby": { "version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa", "aliases": [ "acs authorchoice cc by" ] }, "acs-authorchoice-ccbyncnd": { "version_key": "acs-authorchoice-ccbyncnd", "name_key": "acs-authorchoice-ccbyncnd", "family_key": "publisher-oa" }, "acs-authorchoice-nih": { "version_key": "acs-authorchoice-nih", "name_key": "acs-authorchoice-nih", "family_key": "publisher-oa" }, "agpl-3": { "version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft", "aliases": [ "agpl-v3", "agpl 3", "agpl", "agpl v3", "agpl-3.0+" ] }, "aip-rights": { "version_key": "aip-rights", "name_key": "aip-rights", "family_key": "publisher-proprietary", "aliases": [ "aip permissions" ] }, "all rights reserved": { "version_key": "all-rights-reserved", "name_key": "all-rights-reserved", "family_key": "publisher-proprietary", "aliases": [ "all-rights-reserved" ] }, "apache 2.0": { "version_key": "apache-2.0", "name_key": "apache", "family_key": "osi", "aliases": [ "apache 2", "apache", "apache license", "apache license 2.0" ] }, "aps-default": { "version_key": "aps-default", "name_key": "aps-default", "family_key": "publisher-proprietary", "aliases": [ "aps default license" ] }, "aps-tdm": { "version_key": "aps-tdm", "name_key": "aps-tdm", "family_key": "publisher-tdm", "aliases": [ "aps text mining" ] }, "author manuscript": { "version_key": "author-manuscript", "name_key": "author-manuscript", "family_key": "publisher-oa", "aliases": [ "author-manuscript" ] }, "bmj-copyright": { "version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary" }, "bsd 2-clause": { "version_key": "bsd-2-clause", "name_key": "bsd-2-clause", "family_key": "osi", "aliases": [ "bsd 2 clause", "bsd-2-clause", "bsd-2" ] }, "bsd 3-clause": { "version_key": "bsd-3-clause", "name_key": "bsd-3-clause", "family_key": "osi", "aliases": [ "bsd 3 clause", "bsd-3-clause", "bsd-3", "bsd-3 license", "bsd", "bsd license" ], "justification": "BSD 3-Clause is sometimes called 'BSD', so we need to make sure that this doesn't get confused with the generic 'bsd' alias for the BSD-2-Clause license." }, "cc by": { "version_key": "cc-by", "name_key": "cc-by", "family_key": "cc", "aliases": [ "cc-by", "cc by", "creative commons attribution", "creative commons attribution license", "creative commons by" ] }, "cc by 1.0": { "version_key": "cc-by-1.0", "name_key": "cc-by", "family_key": "cc" }, "cc by 2.0": { "version_key": "cc-by-2.0", "name_key": "cc-by", "family_key": "cc" }, "cc by 2.5": { "version_key": "cc-by-2.5", "name_key": "cc-by", "family_key": "cc" }, "cc by 3.0": { "version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc", "aliases": [ "cc-by-3.0", "cc-by-3", "creative commons attribution 3.0" ] }, "cc by 4.0": { "version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc", "aliases": [ "cc-by-4.0", "cc by 4", "cc-by 4", "cc-by-4", "creative commons attribution 4.0", "creative commons attribution 4.0 international", "creative commons attribution 4.0 international license", "creative commons by 4.0" ] }, "cc by-nc": { "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc", "aliases": [ "cc-by-nc", "cc by nc", "cc-by nc", "creative commons attribution-noncommercial", "creative commons by-nc" ] }, "cc by-nc 3.0": { "version_key": "cc-by-nc-3.0", "name_key": "cc-by-nc", "family_key": "cc" }, "cc by-nc 4.0": { "version_key": "cc-by-nc-4.0", "name_key": "cc-by-nc", "family_key": "cc", "aliases": [ "cc-by-nc-4.0", "cc by nc 4", "cc-by nc 4", "cc by nc-4", "cc-by nc-4", "cc-by-nc 4", "creative commons attribution-noncommercial 4.0", "creative commons attribution-noncommercial 4.0 international", "creative commons attribution-noncommercial 4.0 international license", "creative commons by-nc 4.0" ] }, "cc by-nc-nd": { "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc", "aliases": [ "cc-by-nc-nd", "cc by nc-nd", "cc by nc nd", "cc-by nc-nd", "creative commons attribution-noncommercial-noderivatives", "creative commons by-nc-nd" ] }, "cc by-nc-nd 3.0": { "version_key": "cc-by-nc-nd-3.0", "name_key": "cc-by-nc-nd", "family_key": "cc" }, "cc by-nc-nd 3.0 igo": { "version_key": "cc-by-nc-nd-3.0-igo", "name_key": "cc-by-nc-nd", "family_key": "cc", "justification": "IGO is a jurisdiction tag not a rights modifier. Rights profile (Attribution + NonCommercial + NoDerivatives) is identical to base instrument. Enforcement differs (international arbitration vs domestic courts) but does not affect license type." }, "cc by-nc-nd 4.0": { "version_key": "cc-by-nc-nd-4.0", "name_key": "cc-by-nc-nd", "family_key": "cc", "aliases": [ "cc-by-nc-nd-4.0", "cc by nc-nd 4", "cc-by nc-nd 4", "cc by nc-nd-4", "cc-by nc-nd-4", "cc-by-nc-nd 4", "creative commons attribution-noncommercial-noderivatives 4.0", "creative commons attribution-noncommercial-noderivatives 4.0 international", "creative commons attribution-noncommercial-noderivatives 4.0 international license", "creative commons by-nc-nd 4.0" ] }, "cc by-nc-sa": { "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc", "aliases": [ "cc-by-nc-sa", "cc by nc-sa", "cc by nc sa", "cc-by nc-sa", "creative commons by-nc-sa" ] }, "cc by-nc-sa 3.0": { "version_key": "cc-by-nc-sa-3.0", "name_key": "cc-by-nc-sa", "family_key": "cc" }, "cc by-nc-sa 4.0": { "version_key": "cc-by-nc-sa-4.0", "name_key": "cc-by-nc-sa", "family_key": "cc", "aliases": [ "cc-by-nc-sa-4.0", "cc by nc-sa 4", "cc-by nc-sa 4", "cc-by-nc-sa 4", "cc by nc-sa-4", "cc-by nc-sa-4", "creative commons attribution-noncommercial-sharealike 4.0", "creative commons attribution-noncommercial-sharealike 4.0 international", "creative commons attribution-noncommercial-sharealike 4.0 international license", "creative commons by-nc-sa 4.0" ] }, "cc by-nd": { "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc", "aliases": [ "cc-by-nd", "cc by nd", "cc-by nd", "creative commons by-nd", "creative commons attribution-noderivatives" ] }, "cc by-nd 3.0": { "version_key": "cc-by-nd-3.0", "name_key": "cc-by-nd", "family_key": "cc" }, "cc by-nd 4.0": { "version_key": "cc-by-nd-4.0", "name_key": "cc-by-nd", "family_key": "cc", "aliases": [ "cc-by-nd-4.0", "cc by nd 4", "cc-by nd 4", "cc by nd-4", "cc-by nd-4", "cc-by-nd 4", "creative commons attribution-noderivatives 4.0", "creative commons attribution-noderivatives 4.0 international", "creative commons attribution-noderivatives 4.0 international license", "creative commons by-nd 4.0" ] }, "cc by-sa": { "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc", "aliases": [ "cc-by-sa", "cc by sa", "cc-by sa", "creative commons attribution-sharealike", "creative commons by-sa" ] }, "cc by-sa 3.0": { "version_key": "cc-by-sa-3.0", "name_key": "cc-by-sa", "family_key": "cc" }, "cc by-sa 4.0": { "version_key": "cc-by-sa-4.0", "name_key": "cc-by-sa", "family_key": "cc", "aliases": [ "cc-by-sa-4.0", "cc by sa 4", "cc-by sa 4", "cc by sa-4", "cc-by sa-4", "cc-by-sa 4", "creative commons attribution-sharealike 4.0", "creative commons attribution-sharealike 4.0 international", "creative commons attribution-sharealike 4.0 international license", "creative commons by-sa 4.0" ] }, "cc-pdm 1.0": { "version_key": "cc-pdm-1.0", "name_key": "cc-pdm", "family_key": "public-domain", "aliases": [ "cc-pdm-1.0", "cc pdm 1.0", "cc pdm-1.0", "cc-pdm", "cc pdm", "creative commons public domain", "creative commons public domain mark 1.0", "creative commons public domain mark" ] }, "cc0 1.0": { "version_key": "cc0-1.0", "name_key": "cc0", "family_key": "cc0", "aliases": [ "cc0-1.0", "cc-zero 1.0", "cc zero 1.0", "creative commons zero 1.0", "cc0", "cc 0", "cc zero", "creative commons zero", "cc-zero" ] }, "cup-terms": { "version_key": "cup-terms", "name_key": "cup-terms", "family_key": "publisher-proprietary", "aliases": [ "cambridge terms" ] }, "degruyter-terms": { "version_key": "degruyter-terms", "name_key": "degruyter-terms", "family_key": "publisher-proprietary", "aliases": [ "de gruyter terms" ] }, "elsevier oa": { "version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa", "aliases": [ "elsevier-oa", "elsevier user license" ] }, "elsevier tdm": { "version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm", "aliases": [ "elsevier tdmu", "elsevier-tdm" ] }, "gpl-2": { "version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft", "aliases": [ "gpl-v2", "gpl 2", "gnu gpl v2", "gpl v2", "gpl-2.0+" ] }, "gpl-3": { "version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft", "aliases": [ "gpl-v3", "gpl v3 only", "gpl 3", "gnu gpl", "gnu gpl v3", "gpl", "gpl v3", "gpl-3.0+" ], "justification": "gnu gpl, gnu gpl v3, gpl, gpl v3, gpl-3, and gpl-3.0+ are all standard aliases for GPL-3.0." }, "implied oa": { "version_key": "implied-oa", "name_key": "implied-oa", "family_key": "publisher-oa", "aliases": [ "implied open access", "implied-oa" ] }, "iop-copyright": { "version_key": "iop-copyright", "name_key": "iop-copyright", "family_key": "publisher-proprietary" }, "iop-tdm": { "version_key": "iop-tdm", "name_key": "iop-tdm", "family_key": "publisher-tdm", "aliases": [ "iop text and data mining" ] }, "isc license": { "version_key": "isc", "name_key": "isc", "family_key": "osi" }, "jama-cc-by": { "version_key": "jama-cc-by", "name_key": "jama-cc-by", "family_key": "publisher-oa", "aliases": [ "jama open access" ] }, "lgpl": { "version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft" }, "lgpl v2.1": { "version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft" }, "lgpl v3": { "version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft" }, "lgpl-2": { "version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft", "aliases": [ "lgpl-v2", "lgpl 2", "lgpl-2.1-only", "lgpl-2.1-or-later" ] }, "lgpl-2.1+": { "version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft" }, "lgpl-3": { "version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft", "aliases": [ "lgpl-v3", "lgpl 3" ] }, "lgpl-3.0+": { "version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft" }, "mit license": { "version_key": "mit", "name_key": "mit", "family_key": "osi", "aliases": [ "the mit license" ] }, "mozilla public license 2.0": { "version_key": "mpl-2.0", "name_key": "mpl", "family_key": "osi", "aliases": [ "mpl", "mpl-2.0", "mpl 2.0", "mozilla license", "mozilla public license", "mozilla" ] }, "no reuse": { "version_key": "no-reuse", "name_key": "no-reuse", "family_key": "publisher-proprietary", "aliases": [ "no-reuse" ] }, "odbl": { "version_key": "odbl", "name_key": "odbl", "family_key": "open-data", "aliases": [ "open database license" ] }, "odc-by": { "version_key": "odc-by", "name_key": "odc-by", "family_key": "open-data" }, "other-oa": { "version_key": "other-oa", "name_key": "other-oa", "family_key": "other-oa", "aliases": [ "open access", "open-access" ] }, "oup-chorus": { "version_key": "oup-chorus", "name_key": "oup-chorus", "family_key": "publisher-oa" }, "oup-terms": { "version_key": "oup-terms", "name_key": "oup-terms", "family_key": "publisher-proprietary", "aliases": [ "oup standard publication" ] }, "pd": { "version_key": "public-domain", "name_key": "public-domain", "family_key": "public-domain", "aliases": [ "public domain", "public-domain" ] }, "pddl": { "version_key": "pddl", "name_key": "pddl", "family_key": "open-data" }, "pnas terms": { "version_key": "pnas-licenses", "name_key": "pnas-licenses", "family_key": "publisher-proprietary", "aliases": [ "pnas-licenses" ] }, "rsc-terms": { "version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary" }, "sage-permissions": { "version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary" }, "springer tdm": { "version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm", "aliases": [ "springer-tdm" ] }, "springernature-tdm": { "version_key": "springernature-tdm", "name_key": "springernature-tdm", "family_key": "publisher-tdm", "aliases": [ "springer nature tdm", "springer nature text and data mining" ] }, "tandf-terms": { "version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary", "aliases": [ "taylor and francis terms", "taylor francis terms" ] }, "thieme nlm": { "version_key": "thieme-nlm", "name_key": "thieme-nlm", "family_key": "publisher-oa", "aliases": [ "thieme-nlm" ] }, "unlicense": { "version_key": "unlicense", "name_key": "unlicense", "family_key": "osi" }, "unspecified oa": { "version_key": "unspecified-oa", "name_key": "unspecified-oa", "family_key": "other-oa", "aliases": [ "unspecified-oa" ] }, "wiley terms": { "version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary", "aliases": [ "wiley-terms" ] }, "wiley-am": { "version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary", "aliases": [ "wiley author manuscript" ] }, "wiley-tdm": { "version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm", "aliases": [ "wiley tdm license" ] }, "wiley-vor": { "version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary" }, "wtfpl": { "version_key": "wtfpl", "name_key": "wtfpl", "family_key": "osi" }, "zlib": { "version_key": "zlib", "name_key": "zlib", "family_key": "osi" }, "© the author(s)": { "version_key": "publisher-specific-oa", "name_key": "publisher-specific-oa", "family_key": "publisher-oa", "aliases": [ "publisher specific oa", "publisher-specific-oa" ] } } src/licence_normaliser/data/prose/prose_patterns.json ===================================================== src/licence_normaliser/data/prose/prose_patterns.json [ {"pattern": "cc\\s*by-nc-nd\\s*4\\.0", "version_key": "cc-by-nc-nd-4.0", "name_key": "cc-by-nc-nd", "family_key": "cc"}, {"pattern": "cc\\s*by-nc-nd\\s*3\\.0", "version_key": "cc-by-nc-nd-3.0", "name_key": "cc-by-nc-nd", "family_key": "cc"}, {"pattern": "cc\\s*by-nc-nd", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"}, {"pattern": "cc\\s*by-nc-sa\\s*4\\.0", "version_key": "cc-by-nc-sa-4.0", "name_key": "cc-by-nc-sa", "family_key": "cc"}, {"pattern": "cc\\s*by-nc-sa\\s*3\\.0", "version_key": "cc-by-nc-sa-3.0", "name_key": "cc-by-nc-sa", "family_key": "cc"}, {"pattern": "creative\\s+commons\\s+by-nc-sa", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"}, {"pattern": "creative\\s+commons\\s+by-nc-nd", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"}, {"pattern": "creative\\s+commons\\s+by-nc", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"}, {"pattern": "creative\\s+commons\\s+by-nd", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"}, {"pattern": "creative\\s+commons\\s+by-sa", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"}, {"pattern": "creative\\s+commons\\s+by", "version_key": "cc-by", "name_key": "cc-by", "family_key": "cc"}, {"pattern": "cc\\s*by-nc-sa", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"}, {"pattern": "cc\\s*by-nc\\s*4\\.0", "version_key": "cc-by-nc-4.0", "name_key": "cc-by-nc", "family_key": "cc"}, {"pattern": "cc\\s*by-nc\\s*3\\.0", "version_key": "cc-by-nc-3.0", "name_key": "cc-by-nc", "family_key": "cc"}, {"pattern": "cc\\s*by-nc", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"}, {"pattern": "cc\\s*by-sa\\s*4\\.0", "version_key": "cc-by-sa-4.0", "name_key": "cc-by-sa", "family_key": "cc"}, {"pattern": "cc\\s*by-sa\\s*3\\.0", "version_key": "cc-by-sa-3.0", "name_key": "cc-by-sa", "family_key": "cc"}, {"pattern": "cc\\s*by-sa", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"}, {"pattern": "cc\\s*by-nd\\s*4\\.0", "version_key": "cc-by-nd-4.0", "name_key": "cc-by-nd", "family_key": "cc"}, {"pattern": "cc\\s*by-nd\\s*3\\.0", "version_key": "cc-by-nd-3.0", "name_key": "cc-by-nd", "family_key": "cc"}, {"pattern": "cc\\s*by-nd", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"}, {"pattern": "cc\\s*by\\s*4\\.0", "version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc"}, {"pattern": "cc\\s*by\\s*3\\.0", "version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"}, {"pattern": "cc\\s*by\\s*2\\.0", "version_key": "cc-by-2.0", "name_key": "cc-by", "family_key": "cc"}, {"pattern": "\\bcc\\s*by\\b(?!\\s*-)", "version_key": "cc-by", "name_key": "cc-by", "family_key": "cc"}, {"pattern": "\\bcc\\s*0\\b|cc\\s*zero", "version_key": "cc0", "name_key": "cc0", "family_key": "cc0"}, {"pattern": "attribution.{0,30}non.?commercial.{0,30}no.?deriv", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"}, {"pattern": "attribution.{0,30}non.?commercial.{0,30}share.?alike", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"}, {"pattern": "attribution.{0,30}non.?commercial", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"}, {"pattern": "attribution.{0,30}no.?deriv", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"}, {"pattern": "attribution.{0,30}share.?alike", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"}, {"pattern": "elsevier.*tdm|tdm.*elsevier", "version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"}, {"pattern": "elsevier.*user\\s*licen", "version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"}, {"pattern": "wiley.*tdm|tdm.*wiley", "version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm"}, {"pattern": "springer.*tdm|tdm.*springer", "version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"}, {"pattern": "acs\\s*authorchoice.*cc\\s*by(?!-nc)", "version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa"}, {"pattern": "acs\\s*authorchoice", "version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"}, {"pattern": "all\\s*rights\\s*reserved", "version_key": "all-rights-reserved", "name_key": "all-rights-reserved", "family_key": "publisher-proprietary"}, {"pattern": "author\\s*manuscript", "version_key": "author-manuscript", "name_key": "author-manuscript", "family_key": "publisher-oa"}, {"pattern": "public\\s*domain", "version_key": "public-domain", "name_key": "public-domain", "family_key": "public-domain"}, {"pattern": "open\\s*access", "version_key": "other-oa", "name_key": "other-oa", "family_key": "other-oa"} ] src/licence_normaliser/data/publishers/publishers.json ====================================================== src/licence_normaliser/data/publishers/publishers.json { "_comment": "Publisher-specific license URLs and shorthand aliases.", "_comment2": "URLs: normalized to https with no trailing slash on lookup.", "_comment3": "Aliases: cleaned-lowercase form -> version_key.", "urls": { "https://www.elsevier.com/open-access/userlicense/1.0/": { "version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa" }, "http://www.elsevier.com/open-access/userlicense/1.0/": { "version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa" }, "https://www.elsevier.com/tdm/userlicense/1.0/": { "version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm" }, "http://www.elsevier.com/tdm/userlicense/1.0/": { "version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm" }, "http://doi.wiley.com/10.1002/tdm_license_1": { "version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm" }, "http://doi.wiley.com/10.1002/tdm_license_1.1": { "version_key": "wiley-tdm-1.1", "name_key": "wiley-tdm", "family_key": "publisher-tdm" }, "http://onlinelibrary.wiley.com/termsAndConditions#vor": { "version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary" }, "http://onlinelibrary.wiley.com/termsAndConditions#am": { "version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary" }, "https://onlinelibrary.wiley.com/termsandconditions#vor": { "version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary" }, "https://onlinelibrary.wiley.com/termsandconditions#am": { "version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary" }, "https://onlinelibrary.wiley.com/termsandconditions": { "version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary" }, "https://onlinelibrary.wiley.com/terms-and-conditions": { "version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary" }, "https://www.springer.com/tdm": { "version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm" }, "http://www.springer.com/tdm": { "version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm" }, "https://www.springernature.com/gp/researchers/text-and-data-mining": { "version_key": "springernature-tdm", "name_key": "springernature-tdm", "family_key": "publisher-tdm" }, "https://www.tandfonline.com/action/showCopyRight": { "version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary" }, "https://www.tandfonline.com/action/showcopyright": { "version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary" }, "https://tandfonline.com/action/showcopyright": { "version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary" }, "https://www.tandfonline.com/action/showcopyright?show=full": { "version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary" }, "https://us.sagepub.com/en-us/nam/journals-permissions": { "version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary" }, "https://www.sagepub.com/journalspermissions.nav": { "version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary" }, "https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": { "version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa" }, "http://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": { "version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa" }, "https://pubs.acs.org/page/policy/authorchoice_ccbyncnd_termsofuse.html": { "version_key": "acs-authorchoice-ccbyncnd", "name_key": "acs-authorchoice-ccbyncnd", "family_key": "publisher-oa" }, "https://pubs.acs.org/page/policy/authorchoice_termsofuse.html": { "version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa" }, "https://pubs.acs.org/page/policy/acs_authorchoice_with_nih_addendum_termsofuse.html": { "version_key": "acs-authorchoice-nih", "name_key": "acs-authorchoice-nih", "family_key": "publisher-oa" }, "https://doi.org/10.1021/policy/oa-license": { "version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa" }, "https://www.rsc.org/journals-books-databases/journal-authors-reviewers/licences-copyright-permissions/": { "version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary" }, "https://www.rsc.org/help/disclaimer/pages/term3.aspx": { "version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary" }, "https://iopscience.iop.org/info/page/text-and-data-mining": { "version_key": "iop-tdm", "name_key": "iop-tdm", "family_key": "publisher-tdm" }, "http://iopscience.iop.org/info/page/text-and-data-mining": { "version_key": "iop-tdm", "name_key": "iop-tdm", "family_key": "publisher-tdm" }, "https://iopscience.iop.org/page/copyright": { "version_key": "iop-copyright", "name_key": "iop-copyright", "family_key": "publisher-proprietary" }, "https://www.bmj.com/company/legal-stuff/copyright-notice/": { "version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary" }, "https://group.bmj.com/group/rights-licensing/permissions": { "version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary" }, "https://www.science.org/content/page/science-licenses-journal-article-reuse": { "version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary" }, "https://www.sciencemag.org/about/science-licenses-journal-article-reuse": { "version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary" }, "https://www.pnas.org/site/aboutpnas/licenses.xhtml": { "version_key": "pnas-licenses", "name_key": "pnas-licenses", "family_key": "publisher-proprietary" }, "https://link.aps.org/licenses/aps-default-license": { "version_key": "aps-default", "name_key": "aps-default", "family_key": "publisher-proprietary" }, "https://link.aps.org/licenses/aps-default-text-mining-license": { "version_key": "aps-tdm", "name_key": "aps-tdm", "family_key": "publisher-tdm" }, "https://www.cambridge.org/core/terms": { "version_key": "cup-terms", "name_key": "cup-terms", "family_key": "publisher-proprietary" }, "https://publishing.aip.org/authors/rights-and-permissions": { "version_key": "aip-rights", "name_key": "aip-rights", "family_key": "publisher-proprietary" }, "http://publishing.aip.org/authors/rights-and-permissions": { "version_key": "aip-rights", "name_key": "aip-rights", "family_key": "publisher-proprietary" }, "https://jamanetwork.com/pages/cc-by-license-permissions": { "version_key": "jama-cc-by", "name_key": "jama-cc-by", "family_key": "publisher-oa" }, "https://www.degruyter.com/dg/page/496": { "version_key": "degruyter-terms", "name_key": "degruyter-terms", "family_key": "publisher-proprietary" }, "https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model": { "version_key": "oup-chorus", "name_key": "oup-chorus", "family_key": "publisher-oa" }, "https://academic.oup.com/pages/standard-publication-reuse-rights": { "version_key": "oup-terms", "name_key": "oup-terms", "family_key": "publisher-proprietary" }, "https://www.gnu.org/licenses/gpl-2.0.html": { "version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft" }, "https://www.gnu.org/licenses/gpl-3.0.html": { "version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft" }, "https://www.gnu.org/licenses/agpl-3.0.html": { "version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft" }, "https://www.gnu.org/licenses/lgpl-2.1.html": { "version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft" }, "https://www.gnu.org/licenses/lgpl-3.0.html": { "version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft" }, "https://opendatacommons.org/licenses/odbl/1-0/": { "version_key": "odbl", "name_key": "odbl", "family_key": "open-data" }, "https://opendatacommons.org/licenses/by/1-0/": { "version_key": "odc-by", "name_key": "odc-by", "family_key": "open-data" }, "https://opendatacommons.org/licenses/pddl/1-0/": { "version_key": "pddl", "name_key": "pddl", "family_key": "open-data" } }, "shorthand_aliases": { "elsevier user license": "elsevier-oa", "elsevier tdm": "elsevier-tdm", "elsevier tdmu": "elsevier-tdm", "wiley tdm license": "wiley-tdm", "wiley tdm": "wiley-tdm", "wiley vor": "wiley-vor", "wiley am": "wiley-am", "wiley author manuscript": "wiley-am", "springer tdm": "springer-tdm", "springer nature tdm": "springernature-tdm", "springer nature text and data mining": "springernature-tdm", "tandf terms": "tandf-terms", "taylor and francis terms": "tandf-terms", "taylor francis terms": "tandf-terms", "sage permissions": "sage-permissions", "acs authorchoice": "acs-authorchoice", "acs author choice": "acs-authorchoice", "acs authorchoice cc by": "acs-authorchoice-ccby", "acs authorchoice cc by nc nd": "acs-authorchoice-ccbyncnd", "acs authorchoice nih": "acs-authorchoice-nih", "rsc terms": "rsc-terms", "rsc copyright": "rsc-terms", "iop tdm": "iop-tdm", "iop text and data mining": "iop-tdm", "iop copyright": "iop-copyright", "bmj copyright": "bmj-copyright", "bmj permissions": "bmj-copyright", "aaas author reuse": "aaas-author-reuse", "aaas reuse": "aaas-author-reuse", "science author reuse": "aaas-author-reuse", "pnas licenses": "pnas-licenses", "pnas terms": "pnas-licenses", "aps default": "aps-default", "aps tdm": "aps-tdm", "aps text mining": "aps-tdm", "aps default license": "aps-default", "cambridge terms": "cup-terms", "cup terms": "cup-terms", "aip rights": "aip-rights", "aip permissions": "aip-rights", "jama cc by": "jama-cc-by", "jama open access": "jama-cc-by", "degruyter terms": "degruyter-terms", "de gruyter terms": "degruyter-terms", "oup chorus": "oup-chorus", "oup terms": "oup-terms", "oup standard publication": "oup-terms", "thieme nlm": "thieme-nlm", "implied oa": "implied-oa", "implied open access": "implied-oa", "unspecified oa": "unspecified-oa", "publisher specific oa": "publisher-specific-oa", "author manuscript": "author-manuscript", "all rights reserved": "all-rights-reserved", "no reuse": "no-reuse", "public domain": "public-domain", "open access": "other-oa", "creative commons public domain": "cc-pdm-1.0", "pd": "public-domain" } } src/licence_normaliser/data/urls/url_map.json ============================================= src/licence_normaliser/data/urls/url_map.json { "_comment": "URL -> metadata dict. Both http and https variants may be listed.", "_comment2": "Normalisation (https, no trailing slash) is applied on load.", "https://creativecommons.org/licenses/by/4.0/": {"version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc"}, "https://creativecommons.org/licenses/by/3.0/": {"version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"}, "https://creativecommons.org/licenses/by/2.5/": {"version_key": "cc-by-2.5", "name_key": "cc-by", "family_key": "cc"}, "https://creativecommons.org/licenses/by/2.0/": {"version_key": "cc-by-2.0", "name_key": "cc-by", "family_key": "cc"}, "https://creativecommons.org/licenses/by/1.0/": {"version_key": "cc-by-1.0", "name_key": "cc-by", "family_key": "cc"}, "https://creativecommons.org/licenses/by/3.0/deed.en_us": {"version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"}, "https://creativecommons.org/licenses/by-sa/4.0/": {"version_key": "cc-by-sa-4.0", "name_key": "cc-by-sa", "family_key": "cc"}, "https://creativecommons.org/licenses/by-sa/3.0/": {"version_key": "cc-by-sa-3.0", "name_key": "cc-by-sa", "family_key": "cc"}, "https://creativecommons.org/licenses/by-sa/2.5/": {"version_key": "cc-by-sa-2.5", "name_key": "cc-by-sa", "family_key": "cc"}, "https://creativecommons.org/licenses/by-sa/2.0/": {"version_key": "cc-by-sa-2.0", "name_key": "cc-by-sa", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nd/4.0/": {"version_key": "cc-by-nd-4.0", "name_key": "cc-by-nd", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nd/3.0/": {"version_key": "cc-by-nd-3.0", "name_key": "cc-by-nd", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nd/2.0/": {"version_key": "cc-by-nd-2.0", "name_key": "cc-by-nd", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc/4.0/": {"version_key": "cc-by-nc-4.0", "name_key": "cc-by-nc", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc/3.0/": {"version_key": "cc-by-nc-3.0", "name_key": "cc-by-nc", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc/2.5/": {"version_key": "cc-by-nc-2.5", "name_key": "cc-by-nc", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc/2.0/": {"version_key": "cc-by-nc-2.0", "name_key": "cc-by-nc", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-sa/4.0/": {"version_key": "cc-by-nc-sa-4.0", "name_key": "cc-by-nc-sa", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-sa/3.0/": {"version_key": "cc-by-nc-sa-3.0", "name_key": "cc-by-nc-sa", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-sa/2.5/": {"version_key": "cc-by-nc-sa-2.5", "name_key": "cc-by-nc-sa", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-sa/2.0/": {"version_key": "cc-by-nc-sa-2.0", "name_key": "cc-by-nc-sa", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-nd/4.0/": {"version_key": "cc-by-nc-nd-4.0", "name_key": "cc-by-nc-nd", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-nd/3.0/": {"version_key": "cc-by-nc-nd-3.0", "name_key": "cc-by-nc-nd", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-nd/2.5/": {"version_key": "cc-by-nc-nd-2.5", "name_key": "cc-by-nc-nd", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-nd/2.0/": {"version_key": "cc-by-nc-nd-2.0", "name_key": "cc-by-nc-nd", "family_key": "cc"}, "https://creativecommons.org/licenses/by/3.0/igo/": {"version_key": "cc-by-3.0-igo", "name_key": "cc-by-igo", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-sa/3.0/igo/": {"version_key": "cc-by-nc-sa-3.0-igo", "name_key": "cc-by-nc-sa-igo", "family_key": "cc"}, "https://creativecommons.org/licenses/by-nc-nd/3.0/igo/": {"version_key": "cc-by-nc-nd-3.0-igo", "name_key": "cc-by-nc-nd-igo", "family_key": "cc"}, "https://creativecommons.org/publicdomain/zero/1.0/": {"version_key": "cc0", "name_key": "cc0", "family_key": "cc0"}, "https://creativecommons.org/publicdomain/mark/1.0/": {"version_key": "cc-pdm", "name_key": "cc-pdm", "family_key": "public-domain"}, "https://www.gnu.org/licenses/gpl-2.0.html": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"}, "https://www.gnu.org/licenses/gpl-2.0": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"}, "http://www.gnu.org/licenses/gpl-2.0.html": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"}, "https://www.gnu.org/licenses/gpl-3.0.html": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"}, "https://www.gnu.org/licenses/gpl-3.0": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"}, "http://www.gnu.org/licenses/gpl-3.0.html": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"}, "https://www.gnu.org/licenses/agpl-3.0.html": {"version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft"}, "https://www.gnu.org/licenses/agpl-3.0": {"version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft"}, "https://www.gnu.org/licenses/lgpl-2.1.html": {"version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft"}, "https://www.gnu.org/licenses/lgpl-2.1": {"version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft"}, "https://www.gnu.org/licenses/lgpl-3.0.html": {"version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft"}, "https://www.gnu.org/licenses/lgpl-3.0": {"version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft"}, "https://opensource.org/licenses/MIT": {"version_key": "mit", "name_key": "mit", "family_key": "osi"}, "https://www.apache.org/licenses/LICENSE-2.0": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"}, "https://www.apache.org/licenses/LICENSE-2.0.html": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"}, "https://opensource.org/licenses/Apache-2.0": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"}, "https://opensource.org/licenses/BSD-2-Clause": {"version_key": "bsd-2-clause", "name_key": "bsd-2-clause", "family_key": "osi"}, "https://opensource.org/licenses/BSD-3-Clause": {"version_key": "bsd-3-clause", "name_key": "bsd-3-clause", "family_key": "osi"}, "https://opensource.org/licenses/ISC": {"version_key": "isc", "name_key": "isc", "family_key": "osi"}, "https://www.mozilla.org/en-US/MPL/2.0/": {"version_key": "mpl-2.0", "name_key": "mpl", "family_key": "osi"}, "https://www.mozilla.org/MPL/2.0/": {"version_key": "mpl-2.0", "name_key": "mpl", "family_key": "osi"}, "https://www.elsevier.com/open-access/userlicense/1.0/": {"version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"}, "http://www.elsevier.com/open-access/userlicense/1.0/": {"version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"}, "https://www.elsevier.com/tdm/userlicense/1.0/": {"version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"}, "http://www.elsevier.com/tdm/userlicense/1.0/": {"version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"}, "http://doi.wiley.com/10.1002/tdm_license_1": {"version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm"}, "http://doi.wiley.com/10.1002/tdm_license_1.1": {"version_key": "wiley-tdm-1.1", "name_key": "wiley-tdm", "family_key": "publisher-tdm"}, "http://onlinelibrary.wiley.com/termsAndConditions#vor": {"version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary"}, "http://onlinelibrary.wiley.com/termsAndConditions#am": {"version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary"}, "https://onlinelibrary.wiley.com/termsandconditions#vor": {"version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary"}, "https://onlinelibrary.wiley.com/termsandconditions#am": {"version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary"}, "https://onlinelibrary.wiley.com/termsandconditions": {"version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary"}, "https://onlinelibrary.wiley.com/terms-and-conditions": {"version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary"}, "https://www.springer.com/tdm": {"version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"}, "http://www.springer.com/tdm": {"version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"}, "https://www.springernature.com/gp/researchers/text-and-data-mining": {"version_key": "springernature-tdm", "name_key": "springernature-tdm", "family_key": "publisher-tdm"}, "https://www.tandfonline.com/action/showCopyRight": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"}, "https://www.tandfonline.com/action/showcopyright": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"}, "https://tandfonline.com/action/showcopyright": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"}, "https://www.tandfonline.com/action/showcopyright?show=full": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"}, "https://us.sagepub.com/en-us/nam/journals-permissions": {"version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary"}, "https://www.sagepub.com/journalspermissions.nav": {"version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary"}, "https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {"version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa"}, "https://pubs.acs.org/page/policy/authorchoice_ccbyncnd_termsofuse.html": {"version_key": "acs-authorchoice-ccbyncnd", "name_key": "acs-authorchoice-ccbyncnd", "family_key": "publisher-oa"}, "https://pubs.acs.org/page/policy/authorchoice_termsofuse.html": {"version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"}, "https://pubs.acs.org/page/policy/acs_authorchoice_with_nih_addendum_termsofuse.html": {"version_key": "acs-authorchoice-nih", "name_key": "acs-authorchoice-nih", "family_key": "publisher-oa"}, "https://doi.org/10.1021/policy/oa-license": {"version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"}, "https://www.rsc.org/journals-books-databases/journal-authors-reviewers/licences-copyright-permissions/": {"version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary"}, "https://www.rsc.org/help/disclaimer/pages/term3.aspx": {"version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary"}, "https://iopscience.iop.org/info/page/text-and-data-mining": {"version_key": "iop-tdm", "name_key": "iop-tdm", "family_key": "publisher-tdm"}, "https://iopscience.iop.org/page/copyright": {"version_key": "iop-copyright", "name_key": "iop-copyright", "family_key": "publisher-proprietary"}, "https://www.bmj.com/company/legal-stuff/copyright-notice/": {"version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary"}, "https://group.bmj.com/group/rights-licensing/permissions": {"version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary"}, "https://www.science.org/content/page/science-licenses-journal-article-reuse": {"version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary"}, "https://www.sciencemag.org/about/science-licenses-journal-article-reuse": {"version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary"}, "https://www.pnas.org/site/aboutpnas/licenses.xhtml": {"version_key": "pnas-licenses", "name_key": "pnas-licenses", "family_key": "publisher-proprietary"}, "https://link.aps.org/licenses/aps-default-license": {"version_key": "aps-default", "name_key": "aps-default", "family_key": "publisher-proprietary"}, "https://link.aps.org/licenses/aps-default-text-mining-license": {"version_key": "aps-tdm", "name_key": "aps-tdm", "family_key": "publisher-tdm"}, "https://www.cambridge.org/core/terms": {"version_key": "cup-terms", "name_key": "cup-terms", "family_key": "publisher-proprietary"}, "https://publishing.aip.org/authors/rights-and-permissions": {"version_key": "aip-rights", "name_key": "aip-rights", "family_key": "publisher-proprietary"}, "https://jamanetwork.com/pages/cc-by-license-permissions": {"version_key": "jama-cc-by", "name_key": "jama-cc-by", "family_key": "publisher-oa"}, "https://www.degruyter.com/dg/page/496": {"version_key": "degruyter-terms", "name_key": "degruyter-terms", "family_key": "publisher-proprietary"}, "https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model": {"version_key": "oup-chorus", "name_key": "oup-chorus", "family_key": "publisher-oa"}, "https://academic.oup.com/pages/standard-publication-reuse-rights": {"version_key": "oup-terms", "name_key": "oup-terms", "family_key": "publisher-proprietary"}, "https://opendatacommons.org/licenses/odbl/1-0/": {"version_key": "odbl", "name_key": "odbl", "family_key": "open-data"}, "https://opendatacommons.org/licenses/by/1-0/": {"version_key": "odc-by", "name_key": "odc-by", "family_key": "open-data"}, "https://opendatacommons.org/licenses/pddl/1-0/": {"version_key": "pddl", "name_key": "pddl", "family_key": "open-data"} } src/licence_normaliser/defaults.py ================================== src/licence_normaliser/defaults.py """Default plugin configuration. These are the plugin CLASSES (not instances) that form the sane defaults. Pass them to LicenseNormaliser - they're instantiated lazily. """ from __future__ import annotations from collections.abc import Mapping from typing import Iterator __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ( "DEFAULT_PLUGINS", "DEFAULT_PLUGIN_KEYS", "get_all_refreshable_plugins", ) DEFAULT_PLUGIN_KEYS = ("registry", "url", "alias", "family", "name", "prose") def get_all_refreshable_plugins() -> list[type]: """Return all plugin classes that support refresh (have url set).""" from .parsers.creativecommons import CreativeCommonsParser from .parsers.opendefinition import OpenDefinitionParser from .parsers.osi import OSIParser from .parsers.scancode_licensedb import ScanCodeLicenseDBParser from .parsers.spdx import SPDXParser return [ SPDXParser, OpenDefinitionParser, OSIParser, ScanCodeLicenseDBParser, CreativeCommonsParser, ] def _load_registry_plugins() -> list[type]: from .parsers.creativecommons import CreativeCommonsParser from .parsers.opendefinition import OpenDefinitionParser from .parsers.osi import OSIParser from .parsers.scancode_licensedb import ScanCodeLicenseDBParser from .parsers.spdx import SPDXParser return [ SPDXParser, OpenDefinitionParser, OSIParser, ScanCodeLicenseDBParser, CreativeCommonsParser, ] def _load_url_plugins() -> list[type]: from .parsers.creativecommons import CreativeCommonsParser from .parsers.opendefinition import OpenDefinitionParser from .parsers.osi import OSIParser from .parsers.publisher import PublisherParser from .parsers.spdx import SPDXParser return [ SPDXParser, OpenDefinitionParser, OSIParser, CreativeCommonsParser, PublisherParser, ] def _load_alias_plugins() -> list[type]: from .parsers.alias import AliasParser from .parsers.publisher import PublisherParser # PublisherParser first, then AliasParser - AliasParser values take precedence return [PublisherParser, AliasParser] def _load_family_plugins() -> list[type]: from .parsers.alias import AliasParser return [AliasParser] def _load_name_plugins() -> list[type]: from .parsers.alias import AliasParser return [AliasParser] def _load_prose_plugins() -> list[type]: from .parsers.prose import ProseParser return [ProseParser] # Lazy-loaded bundle - functions delay imports until actually needed class _LazyDefaults: """Lazy-loading container for default plugins.""" _registry: list[type] | None = None _url: list[type] | None = None _alias: list[type] | None = None _family: list[type] | None = None _name: list[type] | None = None _prose: list[type] | None = None @property def registry(self) -> list[type]: if self._registry is None: self._registry = _load_registry_plugins() return self._registry @property def url(self) -> list[type]: if self._url is None: self._url = _load_url_plugins() return self._url @property def alias(self) -> list[type]: if self._alias is None: self._alias = _load_alias_plugins() return self._alias @property def family(self) -> list[type]: if self._family is None: self._family = _load_family_plugins() return self._family @property def name(self) -> list[type]: if self._name is None: self._name = _load_name_plugins() return self._name @property def prose(self) -> list[type]: if self._prose is None: self._prose = _load_prose_plugins() return self._prose _LAZY = _LazyDefaults() # Convenience accessors - these trigger lazy loading def get_default_registry() -> list[type]: return _LAZY.registry def get_default_url() -> list[type]: return _LAZY.url def get_default_alias() -> list[type]: return _LAZY.alias def get_default_family() -> list[type]: return _LAZY.family def get_default_name() -> list[type]: return _LAZY.name def get_default_prose() -> list[type]: return _LAZY.prose class _LazyPluginsBundle: """Lazy-loading bundle - defers plugin loading until accessed.""" _cache: dict[str, list[type]] = {} def _get_registry(self) -> list[type]: return get_default_registry() def _get_url(self) -> list[type]: return get_default_url() def _get_alias(self) -> list[type]: return get_default_alias() def _get_family(self) -> list[type]: return get_default_family() def _get_name(self) -> list[type]: return get_default_name() def _get_prose(self) -> list[type]: return get_default_prose() def __getitem__(self, key: str) -> list[type]: if key not in self._cache: fn = getattr(self, f"_get_{key}", None) if fn is None: raise KeyError(key) self._cache[key] = fn() return self._cache[key] _DEFAULT_PLUGINS_BUNDLE = _LazyPluginsBundle() class _DefaultPlugins(Mapping): """Lazy dict-like accessor for default plugins.""" def __getitem__(self, key: str) -> list[type]: return _DEFAULT_PLUGINS_BUNDLE[key] def keys(self) -> tuple[str, ...]: return DEFAULT_PLUGIN_KEYS def values(self) -> list[list[type]]: return [self[k] for k in self.keys()] def items(self) -> list[tuple[str, list[type]]]: return [(k, self[k]) for k in self.keys()] def __iter__(self) -> Iterator[str]: return iter(self.keys()) def __len__(self) -> int: return 6 def __contains__(self, key: str) -> bool: return key in self.keys() def copy(self) -> dict: return dict(self.items()) DEFAULT_PLUGINS = _DefaultPlugins() src/licence_normaliser/exceptions.py ==================================== src/licence_normaliser/exceptions.py """licence_normaliser.exceptions - public exception hierarchy. These are the only exceptions that cross the public API boundary. All internal errors are wrapped before propagation. """ from __future__ import annotations __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ( "DataSourceError", "LicenseNormalisationError", "LicenseNormaliserError", "LicenseNotFoundError", ) class LicenseNormaliserError(Exception): """Base exception for all licence-normaliser errors.""" class LicenseNotFoundError(LicenseNormaliserError): """Raised in strict mode when a license string cannot be resolved.""" def __init__(self, raw: str, cleaned: str) -> None: self.raw = raw self.cleaned = cleaned super().__init__( f"License not found: {raw!r} (cleaned: {cleaned!r}). " "Pass strict=False to return an 'unknown' result instead." ) class DataSourceError(LicenseNormaliserError): """Raised when a data source file cannot be loaded or parsed.""" class LicenseNormalisationError(ValueError): """Raised when ``strict=True`` and no canonical license could be resolved.""" src/licence_normaliser/parsers/__init__.py ========================================== src/licence_normaliser/parsers/__init__.py src/licence_normaliser/parsers/alias.py ======================================= src/licence_normaliser/parsers/alias.py """Alias parser - loads aliases.json with rich metadata for aliases/family overrides. Each entry may carry an optional ``aliases`` list of extra lookup keys that all resolve to the same ``version_key``. This lets data authors enumerate explicit variants (e.g. hyphen vs space forms) without any auto-generation magic:: "cc by-nc": { "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc", "aliases": ["cc-by-nc", "cc by nc", "cc-by nc"] } All keys in ``aliases`` inherit the same ``version_key``, ``name_key``, and ``family_key`` as the primary entry. """ from __future__ import annotations import json from pathlib import Path from typing import Any from licence_normaliser.plugins import AliasPlugin, BasePlugin, FamilyPlugin, NamePlugin __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("AliasParser",) def _iter_entries( data: dict[str, Any], ) -> list[tuple[str, dict[str, Any]]]: """Yield (key, meta) pairs, expanding ``aliases`` sub-keys. For every primary entry that has an ``"aliases"`` list, each alias key is emitted as an additional entry with the same metadata dict (minus the ``aliases`` field itself, to keep things tidy). """ results: list[tuple[str, dict[str, Any]]] = [] for primary_key, meta in data.items(): if primary_key.startswith("_"): continue if not isinstance(meta, dict): continue version_key = meta.get("version_key", "") if not version_key: continue results.append((primary_key, meta)) # Expand explicit alias variants for extra_key in meta.get("aliases", []): if not isinstance(extra_key, str) or not extra_key: continue if extra_key == primary_key: continue # already emitted # Build a slim copy without the aliases list to avoid recursion slim_meta = {k: v for k, v in meta.items() if k != "aliases"} results.append((extra_key, slim_meta)) return results class AliasParser(BasePlugin, AliasPlugin, FamilyPlugin, NamePlugin): url = None local_path = "data/aliases/aliases.json" def _load_data(self) -> dict[str, Any]: path = Path(__file__).parent.parent / self.local_path return json.loads(path.read_text(encoding="utf-8")) def parse(self) -> list[tuple[str, dict[str, Any]]]: return _iter_entries(self._load_data()) def load_aliases(self) -> dict[str, str]: aliases: dict[str, str] = {} for alias_key, meta in _iter_entries(self._load_data()): version_key = meta.get("version_key", "") if version_key: aliases[alias_key] = version_key return aliases def load_aliases_with_lines( self, ) -> dict[str, tuple[str, int]]: """Load aliases with their source line numbers. Extra keys from ``aliases`` lists are reported at the line of their primary entry (best approximation without per-alias line tracking). Returns: dict mapping alias_key -> (version_key, line_number) """ path = Path(__file__).parent.parent / self.local_path content = path.read_text(encoding="utf-8") data: dict[str, Any] = json.loads(content) lines = content.splitlines() result: dict[str, tuple[str, int]] = {} for primary_key, meta in data.items(): if primary_key.startswith("_"): continue if not isinstance(meta, dict): continue version_key = meta.get("version_key", "") if not version_key: continue # Find line of the primary key primary_line = 1 for i, line in enumerate(lines, start=1): if f'"{primary_key}"' in line: primary_line = i break result[primary_key] = (version_key, primary_line) for extra_key in meta.get("aliases", []): if not isinstance(extra_key, str) or not extra_key: continue if extra_key == primary_key: continue result[extra_key] = (version_key, primary_line) return result def load_families(self) -> dict[str, str]: data = self._load_data() overrides: dict[str, str] = {} for meta in data.values(): if not isinstance(meta, dict): continue vk = meta.get("version_key", "") fk = meta.get("family_key", "") if vk and fk: overrides[vk] = fk return overrides def load_names(self) -> dict[str, str]: data = self._load_data() names: dict[str, str] = {} for meta in data.values(): if not isinstance(meta, dict): continue vk = meta.get("version_key", "") nk = meta.get("name_key", "") if vk and nk: names[vk] = nk return names src/licence_normaliser/parsers/creativecommons.py ================================================= src/licence_normaliser/parsers/creativecommons.py """Creative Commons parser - scrapes creativecommons.org for multilingual deed URLs.""" from __future__ import annotations import json import re import urllib.request from html.parser import HTMLParser from pathlib import Path from typing import Any from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin CC_LICENSE_RE = re.compile( r"^(by|by-nc|by-nc-nd|by-nc-sa|by-nd|by-sa|" r"zero|pdmark|devnations|" r"nc|nd|sa|sampling|nc-sa|sampling\+|nc-sampling\+|nd-nc)" r"/([\d.]+)" r"(/igo)?" r"(/deed\.\w+)?$", ) VERSION_RE = re.compile(r"^[\d.]+$") def _path_to_license_key(path: str) -> str | None: m = CC_LICENSE_RE.match(path) if not m: return None lic_type, version, igo = m.group(1), m.group(2), m.group(3) prefix_map = { "by": "cc-by", "by-nc": "cc-by-nc", "by-nc-nd": "cc-by-nc-nd", "by-nc-sa": "cc-by-nc-sa", "by-nd": "cc-by-nd", "by-sa": "cc-by-sa", "zero": "cc0", "pdmark": "cc-pdm", "devnations": "cc-devnations", "nc": "cc-nc", "nd": "cc-nd", "sa": "cc-sa", "sampling": "cc-sampling", "nc-sa": "cc-nc-sa", "sampling+": "cc-sampling-plus", "nc-sampling+": "cc-nc-sampling-plus", "nd-nc": "cc-nd-nc", } prefix = prefix_map.get(lic_type) if not prefix: return None suffix = "igo" if igo else "" key = f"{prefix}-{version}" if VERSION_RE.match(version) else prefix if suffix: key = f"{key}-{suffix}" return key.lower() class CCLinkParser(HTMLParser): def __init__(self) -> None: super().__init__() self.in_td = False self.current_cell = "" self.current_row: list[str] = [] self.rows: list[list[str]] = [] def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None: if tag == "td": self.in_td = True self.current_cell = "" elif tag == "a" and self.in_td: href = dict(attrs).get("href") or "" if href: self.current_cell += " AHREF:" + href def handle_endtag(self, tag: str) -> None: if tag == "td": self.in_td = False self.current_row.append(self.current_cell.strip()) elif tag == "tr": if self.current_row: self.rows.append(self.current_row) self.current_row = [] def handle_data(self, data: str) -> None: if self.in_td: self.current_cell += data def _fetch_html(url: str) -> str: req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"}) with urllib.request.urlopen(req, timeout=30) as response: # noqa: S310 return response.read().decode("utf-8") JURISDICTION_CODES = { "au", "at", "be", "br", "ca", "ch", "cl", "cn", "co", "cz", "de", "dk", "ee", "eg", "es", "fi", "fr", "gb", "gr", "hr", "hu", "id", "ie", "il", "in", "ir", "is", "it", "jp", "kr", "lt", "lu", "lv", "ma", "mt", "mx", "my", "nl", "no", "nz", "pe", "ph", "pl", "pt", "ro", "rs", "ru", "se", "si", "sk", "th", "tr", "tw", "ua", "ug", "us", "za", "vn", } def _is_international(href: str) -> bool: parts = href.split("/") return not any(p in JURISDICTION_CODES for p in parts[1:]) def _extract_deeds(html: str) -> set[str]: parser = CCLinkParser() parser.feed(html) deeds: set[str] = set() for row in parser.rows: if not row: continue jurisdiction = row[0] if jurisdiction != "English": continue for cell in row[1:]: for part in cell.split(): if part.startswith("AHREF:"): href = part[6:] if href and _is_international(href): deeds.add(href) return deeds def _scrape() -> list[dict[str, str]]: pages = [ "https://creativecommons.org/licenses/list.en", "https://creativecommons.org/publicdomain/list.en", ] all_deeds: set[str] = set() try: for page_url in pages: html = _fetch_html(page_url) all_deeds |= _extract_deeds(html) except Exception: pass entries: list[dict[str, str]] = [] seen_keys: set[str] = set() for href in sorted(all_deeds): lic_key = _path_to_license_key(href) if not lic_key: continue url_path = href.rsplit("/deed.", 1)[0] url = f"https://creativecommons.org/licenses/{url_path}/" if lic_key in seen_keys: continue seen_keys.add(lic_key) entries.append({"license_key": lic_key, "url": url, "path": url_path}) return entries class CreativeCommonsParser(BasePlugin, RegistryPlugin, URLPlugin): id = "creativecommons" url = "https://creativecommons.org/licenses/list.en" local_path = "data/creativecommons/creativecommons.json" def parse(self) -> list[tuple[str, dict[str, Any]]]: path = Path(__file__).parent.parent / self.local_path if not path.exists(): return [] data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8")) return [ ( entry["license_key"], { "url": entry["url"], "name": entry["license_key"], "path": entry["path"], }, ) for entry in data if "license_key" in entry ] def load_registry(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path if not path.exists(): return {} data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} for entry in data: key = entry.get("license_key", "") if key: result[key.lower().strip()] = key.lower().strip() return result def load_urls(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path if not path.exists(): return {} data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} for entry in data: key = entry.get("license_key", "") if not key: continue canonical = key.lower().strip() raw_url = entry.get("url", "") if not raw_url: continue clean = raw_url.strip().lower().rstrip("/") if clean.startswith("http://"): clean = "https://" + clean[7:] result[clean] = canonical return result @classmethod def refresh(cls, force: bool = False) -> bool: target = Path(__file__).parent.parent / cls.local_path if target.exists() and not force: return True try: data = _scrape() target.parent.mkdir(parents=True, exist_ok=True) target.write_text( json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8" ) return True except Exception: return False src/licence_normaliser/parsers/opendefinition.py ================================================ src/licence_normaliser/parsers/opendefinition.py """OpenDefinition parser - loads opendefinition_licenses_all.json from package data.""" from __future__ import annotations import json from pathlib import Path from typing import Any from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("OpenDefinitionParser",) class OpenDefinitionParser(BasePlugin, RegistryPlugin, URLPlugin): id = "opendefinition" url = "https://licenses.opendefinition.org/licenses/groups/all.json" local_path = "data/opendefinition/opendefinition.json" def parse(self) -> list[tuple[str, dict[str, Any]]]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) results: list[tuple[str, dict[str, Any]]] = [] for entry in data.values(): if not isinstance(entry, dict): continue lid = entry.get("id", "") url = entry.get("url", "") results.append((lid, {"url": url, "title": entry.get("title", "")})) return results def load_registry(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} for entry in data.values(): if not isinstance(entry, dict): continue lid = entry.get("id", "") if lid: result[lid.lower().strip()] = lid.lower().strip() return result def load_urls(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} for entry in data.values(): if not isinstance(entry, dict): continue lid = entry.get("id", "") if not lid: continue canonical = lid.lower().strip() raw_url = entry.get("url", "") if not raw_url: continue clean = raw_url.strip().lower().rstrip("/") if clean.startswith("http://"): clean = "https://" + clean[7:] result[clean] = canonical return result src/licence_normaliser/parsers/osi.py ===================================== src/licence_normaliser/parsers/osi.py """OSI parser - loads osi.json from package data.""" from __future__ import annotations import json from pathlib import Path from typing import Any from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("OSIParser",) class OSIParser(BasePlugin, RegistryPlugin, URLPlugin): id = "osi" url = "https://opensource.org/api/license" local_path = "data/osi/osi.json" def parse(self) -> list[tuple[str, dict[str, Any]]]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) results: list[tuple[str, dict[str, Any]]] = [] if not isinstance(data, list): return results for entry in data: if not isinstance(entry, dict): continue key = entry.get("id", "") if not key: continue links = entry.get("_links", {}) html_link = links.get("html", {}) url = html_link.get("href", "") if isinstance(html_link, dict) else "" results.append( ( key, { "url": url, "name": entry.get("name", ""), "spdx_id": entry.get("spdx_id", ""), }, ) ) return results def load_registry(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} if not isinstance(data, list): return result for entry in data: if not isinstance(entry, dict): continue key = entry.get("id", "").strip() if key: result[key.lower()] = key.lower() return result def load_urls(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} if not isinstance(data, list): return result for entry in data: if not isinstance(entry, dict): continue key = entry.get("id", "").strip() if not key: continue canonical = key.lower() links = entry.get("_links", {}) html_link = links.get("html", {}) raw_url = html_link.get("href", "") if isinstance(html_link, dict) else "" if not raw_url: continue clean = raw_url.strip().lower().rstrip("/") if clean.startswith("http://"): clean = "https://" + clean[7:] result[clean] = canonical return result src/licence_normaliser/parsers/prose.py ======================================= src/licence_normaliser/parsers/prose.py """Prose pattern parser - loads prose_patterns.json and compiles regex patterns.""" from __future__ import annotations import json import re from pathlib import Path from typing import Any from licence_normaliser.plugins import BasePlugin, ProsePlugin __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("ProseParser",) _COMPILED_PATTERNS: list[tuple[re.Pattern[str], str]] = [] class ProseParser(BasePlugin, ProsePlugin): is_registry_entry = False url = None local_path = "data/prose/prose_patterns.json" def parse(self) -> list[tuple[str, dict[str, Any]]]: path = Path(__file__).parent.parent / self.local_path data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8")) global _COMPILED_PATTERNS _COMPILED_PATTERNS = [] results: list[tuple[str, dict[str, Any]]] = [] for entry in data: pattern_str = entry.get("pattern", "") version_key = entry.get("version_key", "") name_key = entry.get("name_key", "") family_key = entry.get("family_key", "") if pattern_str and version_key: compiled = re.compile(pattern_str, re.IGNORECASE) _COMPILED_PATTERNS.append((compiled, version_key)) results.append( ( pattern_str, { "pattern": compiled, "version_key": version_key, "name_key": name_key, "family_key": family_key, }, ) ) return results def load_prose(self) -> list[tuple[re.Pattern[str], str]]: global _COMPILED_PATTERNS _COMPILED_PATTERNS = [] path = Path(__file__).parent.parent / self.local_path data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8")) for entry in data: pattern_str = entry.get("pattern", "") version_key = entry.get("version_key", "") if pattern_str and version_key: compiled = re.compile(pattern_str, re.IGNORECASE) _COMPILED_PATTERNS.append((compiled, version_key)) return _COMPILED_PATTERNS def load_prose_with_lines(self) -> list[tuple[re.Pattern[str], str, int]]: """Load prose patterns with their source line numbers. Returns: list of (compiled_pattern, version_key, line_number) """ path = Path(__file__).parent.parent / self.local_path content = path.read_text(encoding="utf-8") data: list[dict[str, str]] = json.loads(content) lines = content.splitlines() result: list[tuple[re.Pattern[str], str, int]] = [] for entry in data: pattern_str = entry.get("pattern", "") version_key = entry.get("version_key", "") if pattern_str and version_key: compiled = re.compile(pattern_str, re.IGNORECASE) serialized = json.dumps(pattern_str) line_num = 1 for i, line in enumerate(lines, start=1): if '"pattern"' in line and serialized[:30] in line: line_num = i break result.append((compiled, version_key, line_num)) return result def get_prose_patterns() -> list[tuple[re.Pattern[str], str]]: """Legacy helper: return the compiled prose patterns.""" return _COMPILED_PATTERNS src/licence_normaliser/parsers/publisher.py =========================================== src/licence_normaliser/parsers/publisher.py """Publisher parser - loads publishers.json with URLs and shorthand aliases.""" from __future__ import annotations import json from pathlib import Path from typing import Any from licence_normaliser.plugins import AliasPlugin, BasePlugin, URLPlugin __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("PublisherParser",) class PublisherParser(BasePlugin, AliasPlugin, URLPlugin): url = None local_path = "data/publishers/publishers.json" def parse(self) -> list[tuple[str, dict[str, Any]]]: path = Path(__file__).parent.parent / self.local_path data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8")) results: list[tuple[str, dict[str, Any]]] = [] urls: dict[str, dict[str, str]] = data.get("urls", {}) for url, meta in urls.items(): if isinstance(meta, dict): results.append((url, meta)) return results def load_aliases(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8")) aliases: dict[str, str] = data.get("shorthand_aliases", {}) return dict(aliases) def load_aliases_with_lines(self) -> dict[str, tuple[str, int]]: """Load shorthand aliases with their source line numbers.""" path = Path(__file__).parent.parent / self.local_path content = path.read_text(encoding="utf-8") data: dict[str, Any] = json.loads(content) lines = content.splitlines() result: dict[str, tuple[str, int]] = {} for alias_key, version_key in data.get("shorthand_aliases", {}).items(): for i, line in enumerate(lines, start=1): if f'"{alias_key}"' in line: result[alias_key] = (version_key, i) break return result def load_urls(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} for url, meta in data.get("urls", {}).items(): if not isinstance(meta, dict): continue vk = meta.get("version_key", "") if not vk: continue clean = url.strip().lower().rstrip("/") if clean.startswith("http://"): clean = "https://" + clean[7:] result[clean] = vk return result def load_urls_with_lines(self) -> dict[str, tuple[str, int]]: """Load URLs with their source line numbers.""" path = Path(__file__).parent.parent / self.local_path content = path.read_text(encoding="utf-8") data: dict[str, Any] = json.loads(content) lines = content.splitlines() result: dict[str, tuple[str, int]] = {} for url, meta in data.get("urls", {}).items(): if not isinstance(meta, dict): continue vk = meta.get("version_key", "") if not vk: continue clean = url.strip().lower().rstrip("/") if clean.startswith("http://"): clean = "https://" + clean[7:] for i, line in enumerate(lines, start=1): if f'"{url}"' in line or f'"{clean}"' in line: result[clean] = (vk, i) break return result src/licence_normaliser/parsers/scancode_licensedb.py ==================================================== src/licence_normaliser/parsers/scancode_licensedb.py """ScanCode-licensedb parser - loads scancode_licensedb.json from package data.""" from __future__ import annotations import json from pathlib import Path from typing import Any from licence_normaliser.plugins import BasePlugin, RegistryPlugin __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("ScanCodeLicenseDBParser",) class ScanCodeLicenseDBParser(BasePlugin, RegistryPlugin): id = "scancode-licensedb" url = "https://scancode-licensedb.aboutcode.org/index.json" local_path = "data/scancode_licensedb/scancode_licensedb.json" def parse(self) -> list[tuple[str, dict[str, Any]]]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) results: list[tuple[str, dict[str, Any]]] = [] if not isinstance(data, list): return results for entry in data: if not isinstance(entry, dict): continue key = entry.get("license_key", "") if not key: continue if key.lower() == "unknown": continue spdx_key = entry.get("spdx_license_key") category = entry.get("category", "") results.append( ( key, { "url": "", "name": key, "category": category, "spdx_license_key": spdx_key if spdx_key else "", }, ) ) return results def load_registry(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} if not isinstance(data, list): return result for entry in data: if not isinstance(entry, dict): continue key = entry.get("license_key", "") if key and key.lower() != "unknown": result[key.lower().strip()] = key.lower().strip() return result src/licence_normaliser/parsers/spdx.py ====================================== src/licence_normaliser/parsers/spdx.py """SPDX parser - loads spdx-licenses.json from package data.""" from __future__ import annotations import json from pathlib import Path from typing import Any from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ("SPDXParser",) class SPDXParser(BasePlugin, RegistryPlugin, URLPlugin): id = "spdx" url = "https://raw.githubusercontent.com/spdx/license-list-data/main/json/licenses.json" local_path = "data/spdx/spdx.json" def parse(self) -> list[tuple[str, dict[str, Any]]]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) results: list[tuple[str, dict[str, Any]]] = [] for entry in data.get("licenses", []): if not isinstance(entry, dict): continue lid = entry.get("licenseId", "") urls = entry.get("seeAlso", []) url = urls[0] if urls else "" results.append((lid, {"url": url, "name": entry.get("name", "")})) return results def load_registry(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} for entry in data.get("licenses", []): if not isinstance(entry, dict): continue lid = entry.get("licenseId", "") if lid: result[lid.lower().strip()] = lid.lower().strip() return result def load_urls(self) -> dict[str, str]: path = Path(__file__).parent.parent / self.local_path data = json.loads(path.read_text(encoding="utf-8")) result: dict[str, str] = {} for entry in data.get("licenses", []): if not isinstance(entry, dict): continue lid = entry.get("licenseId", "") if not lid: continue canonical = lid.lower().strip() for raw_url in entry.get("seeAlso", []): if not raw_url: continue clean = raw_url.strip().lower().rstrip("/") if clean.startswith("http://"): clean = "https://" + clean[7:] result[clean] = canonical return result src/licence_normaliser/plugins.py ================================= src/licence_normaliser/plugins.py """Simple plugin interface definitions. Each plugin is a callable that returns a dict or list of tuples. Plugins are passed as CLASSES (not instances) - they're instantiated lazily. """ from __future__ import annotations import json import logging import re import urllib.error import urllib.request from pathlib import Path __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" __all__ = ( "AliasPlugin", "BasePlugin", "FamilyPlugin", "NamePlugin", "ProsePlugin", "RegistryPlugin", "URLPlugin", ) class BasePlugin: """Base class for all plugins with refresh capability.""" url: str | None = None local_path: str = "" @classmethod def refresh(cls, force: bool = False) -> bool: """Fetch fresh data from ``cls.url`` and write to ``cls.local_path``. The local path is resolved relative to the package root (``src/licence_normaliser/``). If ``cls.url`` is None, this is a local-only parser with no external source and the operation succeeds without fetching. Returns True on success, False on failure. """ if not cls.local_path: return False target = Path(__file__).parent / cls.local_path if target.exists() and not force: return True if cls.url is None: return True try: with urllib.request.urlopen(cls.url, timeout=30) as response: # noqa: S310 raw_bytes = response.read() json.loads(raw_bytes.decode("utf-8")) target.parent.mkdir(parents=True, exist_ok=True) target.write_bytes(raw_bytes) return True except urllib.error.URLError as exc: logging.warning( "refresh(%s): URLError fetching %s - %s", cls.__name__, cls.url, exc ) return False except urllib.error.HTTPError as exc: logging.warning( "refresh(%s): HTTPError %s fetching %s", cls.__name__, exc.code, cls.url ) return False except json.JSONDecodeError as exc: logging.error( "refresh(%s): invalid JSON from %s - %s", cls.__name__, cls.url, exc ) return False except OSError as exc: logging.error( "refresh(%s): OSError writing %s - %s", cls.__name__, target, exc ) return False class RegistryPlugin: """Returns key -> canonical_key mappings.""" def load_registry(self) -> dict[str, str]: raise NotImplementedError class URLPlugin: """Returns cleaned_url -> version_key mappings.""" def load_urls(self) -> dict[str, str]: raise NotImplementedError class AliasPlugin: """Returns alias_string -> version_key mappings.""" def load_aliases(self) -> dict[str, str]: raise NotImplementedError class FamilyPlugin: """Returns version_key -> family_key mappings.""" def load_families(self) -> dict[str, str]: raise NotImplementedError class NamePlugin: """Returns version_key -> name_key mappings.""" def load_names(self) -> dict[str, str]: raise NotImplementedError class ProsePlugin: """Returns list of (compiled_pattern, version_key) for prose matching.""" def load_prose(self) -> list[tuple[re.Pattern[str], str]]: raise NotImplementedError src/licence_normaliser/tests/__init__.py ======================================== src/licence_normaliser/tests/__init__.py """Tests for licence_normaliser.""" __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" src/licence_normaliser/tests/conftest.py ======================================== src/licence_normaliser/tests/conftest.py """Shared fixtures for licence_normaliser tests.""" import pytest __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" @pytest.fixture() def mit_raw() -> str: return "MIT" @pytest.fixture() def cc_by_nc_nd_4_raw() -> str: return "CC BY-NC-ND 4.0" @pytest.fixture() def batch_raw() -> list[str]: return ["MIT", "Apache-2.0", "CC BY 4.0"] src/licence_normaliser/tests/test_aliases.py ============================================ src/licence_normaliser/tests/test_aliases.py """Tests for AliasParser - non-CC aliases (Apache, MIT, BSD, GPL, etc.).""" from licence_normaliser import normalise_license class TestNonCCAliases: def test_apache_shorthand(self): v = normalise_license("apache") assert v.key == "apache-2.0" assert v.family.key == "osi" def test_apache_license(self): v = normalise_license("apache license") assert v.key == "apache-2.0" assert v.family.key == "osi" def test_apache_2(self): v = normalise_license("apache 2") assert v.key == "apache-2.0" assert v.family.key == "osi" def test_apache_2_0(self): v = normalise_license("apache 2.0") assert v.key == "apache-2.0" assert v.family.key == "osi" def test_mit_license(self): v = normalise_license("mit license") assert v.key == "mit" assert v.family.key == "osi" def test_the_mit_license(self): v = normalise_license("the mit license") assert v.key == "mit" assert v.family.key == "osi" def test_bsd_shorthand(self): v = normalise_license("bsd") assert v.key == "bsd-3-clause" assert v.family.key == "osi" def test_bsd_license(self): v = normalise_license("bsd license") assert v.key == "bsd-3-clause" assert v.family.key == "osi" def test_mozilla(self): v = normalise_license("mozilla") assert v.key == "mpl-2.0" assert v.family.key == "osi" def test_isc_license(self): v = normalise_license("isc license") assert v.key == "isc" assert v.family.key == "osi" def test_gpl_shorthand(self): v = normalise_license("gpl") assert v.key == "gpl-3.0" assert v.family.key == "copyleft" def test_gnu_gpl(self): v = normalise_license("gnu gpl") assert v.key == "gpl-3.0" assert v.family.key == "copyleft" def test_gnu_gpl_v2(self): v = normalise_license("gnu gpl v2") assert v.key == "gpl-2.0" assert v.family.key == "copyleft" def test_gpl_3_0_or_later(self): v = normalise_license("gpl-3.0+") assert v.key == "gpl-3.0" assert v.family.key == "copyleft" def test_gpl_2_0_or_later(self): v = normalise_license("gpl-2.0+") assert v.key == "gpl-2.0" assert v.family.key == "copyleft" def test_agpl_shorthand(self): v = normalise_license("agpl") assert v.key == "agpl-3.0" assert v.family.key == "copyleft" def test_agpl_3_0_or_later(self): v = normalise_license("agpl-3.0+") assert v.key == "agpl-3.0" assert v.family.key == "copyleft" def test_lgpl_shorthand(self): v = normalise_license("lgpl") assert v.key == "lgpl-3.0" assert v.family.key == "copyleft" def test_lgpl_2_1_or_later(self): v = normalise_license("lgpl-2.1+") assert v.key == "lgpl-2.1" assert v.family.key == "copyleft" def test_lgpl_3_0_or_later(self): v = normalise_license("lgpl-3.0+") assert v.key == "lgpl-3.0" assert v.family.key == "copyleft" def test_unlicense(self): v = normalise_license("unlicense") assert v.key == "unlicense" assert v.family.key == "osi" def test_wtfpl(self): v = normalise_license("wtfpl") assert v.key == "wtfpl" assert v.family.key == "osi" def test_zlib(self): v = normalise_license("zlib") assert v.key == "zlib" assert v.family.key == "osi" def test_open_database_license(self): v = normalise_license("open database license") assert v.key == "odbl" assert v.family.key == "open-data" def test_public_domain(self): v = normalise_license("public domain") assert v.key == "public-domain" assert v.family.key == "public-domain" def test_pd_alias(self): v = normalise_license("pd") assert v.key == "public-domain" assert v.family.key == "public-domain" src/licence_normaliser/tests/test_cache.py ========================================== src/licence_normaliser/tests/test_cache.py """Tests for _cache.py - thread-safe default normaliser singleton.""" from __future__ import annotations import threading from concurrent.futures import ThreadPoolExecutor from licence_normaliser._cache import ( _DefaultNormaliser, get_registry_keys, normalise_license, normalise_licenses, ) from licence_normaliser._normaliser import LicenseNormaliser class TestDefaultNormaliserSingleton: def test_singleton_instance_reused(self) -> None: d1 = _DefaultNormaliser() d2 = _DefaultNormaliser() assert d1.get() is d2.get() def test_get_returns_licence_normaliser(self) -> None: d = _DefaultNormaliser() instance = d.get() assert isinstance(instance, LicenseNormaliser) def test_thread_safety_same_instance(self) -> None: results: list[object | None] = [None] * 20 errors: list[BaseException | None] = [None] * 20 def get_instance(idx: int) -> None: try: d = _DefaultNormaliser() results[idx] = d.get() except BaseException as e: # noqa: BLE001 errors[idx] = e threads = [threading.Thread(target=get_instance, args=(i,)) for i in range(20)] for t in threads: t.start() for t in threads: t.join() assert all(e is None for e in errors) assert results[0] is not None assert all(r is results[0] for r in results if r is not None) def test_concurrent_normalise_license(self) -> None: licenses = ["MIT", "Apache-2.0", "CC BY 4.0", "GPL-3.0", "BSD-3-Clause"] def normalise(lic: str) -> str: v = normalise_license(lic) return v.key with ThreadPoolExecutor(max_workers=10) as executor: futures = [executor.submit(normalise, lic) for lic in licenses * 4] results = [f.result(timeout=5) for f in futures] assert len(results) == len(licenses) * 4 assert set(results) == { "mit", "apache-2.0", "cc-by-4.0", "gpl-3.0", "bsd-3-clause", } class TestModuleLevelAPI: def test_normalise_license_returns_license_version(self) -> None: v = normalise_license("MIT") assert str(v) == "mit" def test_normalise_licenses_returns_list(self) -> None: results = normalise_licenses(["MIT", "Apache-2.0"]) assert len(results) == 2 assert all(str(r) in ("mit", "apache-2.0") for r in results) def test_get_registry_keys_returns_set_of_strings(self) -> None: keys = get_registry_keys() assert isinstance(keys, set) assert len(keys) > 1000 assert "mit" in keys assert "apache-2.0" in keys src/licence_normaliser/tests/test_cli.py ======================================== src/licence_normaliser/tests/test_cli.py """Tests for licence_normaliser CLI - includes new --strict flag.""" from unittest.mock import patch import pytest from licence_normaliser.cli._main import main __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" class TestNormaliseCommand: def test_normalise_mit(self, capsys): with patch("sys.argv", ["licence-normaliser", "normalise", "MIT"]): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 0 assert capsys.readouterr().out.strip() == "mit" def test_normalise_full(self, capsys): with patch( "sys.argv", ["licence-normaliser", "normalise", "--full", "CC BY 4.0"] ): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 0 out = capsys.readouterr().out assert "Key: cc-by-4.0" in out assert "License: cc-by" in out assert "Family: cc" in out def test_normalise_cc_url(self, capsys): with patch( "sys.argv", [ "licence-normaliser", "normalise", "http://creativecommons.org/licenses/by/4.0/", ], ): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 0 assert capsys.readouterr().out.strip() == "cc-by-4.0" def test_normalise_unknown(self, capsys): with patch( "sys.argv", ["licence-normaliser", "normalise", "totally-unknown-xyz"] ): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 0 assert "totally-unknown-xyz" in capsys.readouterr().out def test_normalise_strict_known(self, capsys): with patch("sys.argv", ["licence-normaliser", "normalise", "--strict", "MIT"]): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 0 assert capsys.readouterr().out.strip() == "mit" def test_normalise_strict_unknown_exits_1(self, capsys): with patch( "sys.argv", ["licence-normaliser", "normalise", "--strict", "totally-unknown-xyz-9999"], ): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 1 assert capsys.readouterr().err # error message on stderr class TestBatchCommand: def test_batch_basic(self, capsys): with patch( "sys.argv", ["licence-normaliser", "batch", "MIT", "Apache-2.0", "CC BY 4.0"], ): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 0 out = capsys.readouterr().out assert "MIT: mit" in out assert "Apache-2.0: apache-2.0" in out assert "CC BY 4.0: cc-by-4.0" in out def test_batch_strict_all_known(self, capsys): with patch( "sys.argv", ["licence-normaliser", "batch", "--strict", "MIT", "GPL-3.0"] ): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 0 def test_batch_strict_with_unknown_exits_1(self, capsys): with patch( "sys.argv", ["licence-normaliser", "batch", "--strict", "MIT", "no-such-license-xyz"], ): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 1 class TestVersionFlag: def test_version_flag(self, capsys): with patch("sys.argv", ["licence-normaliser", "--version"]): with pytest.raises(SystemExit) as exc_info: main() assert exc_info.value.code == 0 assert "licence-normaliser" in capsys.readouterr().out src/licence_normaliser/tests/test_core.py ========================================= src/licence_normaliser/tests/test_core.py """End-to-end pipeline tests via the public API.""" from licence_normaliser import normalise_license, normalise_licenses __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" class TestDirectLookup: def test_mit(self): v = normalise_license("mit") assert v.key == "mit" assert v.family.key == "osi" def test_apache(self): v = normalise_license("apache-2.0") assert v.key == "apache-2.0" assert v.family.key == "osi" def test_cc_by_4_0(self): v = normalise_license("cc-by-4.0") assert v.key == "cc-by-4.0" assert v.family.key == "cc" def test_cc_by_nc_nd_4_0(self): v = normalise_license("cc-by-nc-nd-4.0") assert v.key == "cc-by-nc-nd-4.0" assert v.family.key == "cc" def test_cc0_1_0(self): v = normalise_license("cc0-1.0") assert v.key == "cc0-1.0" assert v.family.key == "cc0" def test_gpl_3_0(self): v = normalise_license("gpl-3.0") assert v.key == "gpl-3.0" assert v.family.key == "copyleft" def test_gpl_2_0_only(self): v = normalise_license("gpl-2.0-only") assert v.key == "gpl-2.0-only" assert v.family.key == "copyleft" def test_lgpl_2_1(self): v = normalise_license("lgpl-2.1") assert v.key == "lgpl-2.1" assert v.family.key == "copyleft" def test_agpl_3_0(self): v = normalise_license("agpl-3.0") assert v.key == "agpl-3.0" assert v.family.key == "copyleft" def test_bsd_3_clause(self): v = normalise_license("bsd-3-clause") assert v.key == "bsd-3-clause" assert v.family.key == "osi" def test_isc(self): v = normalise_license("isc") assert v.key == "isc" assert v.family.key == "osi" def test_mpl_2_0(self): v = normalise_license("mpl-2.0") assert v.key == "mpl-2.0" assert v.family.key == "osi" def test_unlicense(self): v = normalise_license("unlicense") assert v.key == "unlicense" assert v.family.key == "osi" def test_wtfpl(self): v = normalise_license("wtfpl") assert v.key == "wtfpl" assert v.family.key == "osi" def test_zlib(self): v = normalise_license("zlib") assert v.key == "zlib" assert v.family.key == "osi" def test_odbl_1_0(self): v = normalise_license("odbl-1.0") assert v.key == "odbl-1.0" assert v.family.key == "open-data" def test_pddl_1_0(self): v = normalise_license("pddl-1.0") assert v.key == "pddl-1.0" assert v.family.key == "data" def test_odc_by_1_0(self): v = normalise_license("odc-by-1.0") assert v.key == "odc-by-1.0" assert v.family.key == "open-data" def test_unknown(self): v = normalise_license("unknown") assert v.key == "unknown" assert v.family.key == "unknown" def test_case_insensitive(self): v = normalise_license("MIT") assert v.key == "mit" v = normalise_license("Apache-2.0") assert v.key == "apache-2.0" class TestBuiltinAliases: def test_cc_by(self): assert normalise_license("CC BY").key == "cc-by" def test_cc_by_4_0(self): assert normalise_license("CC BY 4.0").key == "cc-by-4.0" def test_cc_by_nc_nd_4_0(self): assert normalise_license("CC BY-NC-ND 4.0").key == "cc-by-nc-nd-4.0" def test_cc_by_nc_sa_4_0(self): assert normalise_license("CC BY-NC-SA 4.0").key == "cc-by-nc-sa-4.0" def test_cc0_1_0(self): assert normalise_license("CC0 1.0").key == "cc0-1.0" def test_public_domain(self): assert normalise_license("public domain").key == "public-domain" class TestUrlLookup: def test_cc_by_https(self): v = normalise_license("https://creativecommons.org/licenses/by/4.0/") assert v.key == "cc-by-4.0" def test_cc_by_http(self): v = normalise_license("http://creativecommons.org/licenses/by/4.0/") assert v.key == "cc-by-4.0" def test_cc_by_no_trailing_slash(self): v = normalise_license("https://creativecommons.org/licenses/by/4.0") assert v.key == "cc-by-4.0" def test_mit_url(self): v = normalise_license("https://opensource.org/licenses/MIT") assert v.key == "mit" class TestFamilyInference: def test_cc_family(self): v = normalise_license("cc-by-4.0") assert v.family.key == "cc" def test_cc0_family(self): v = normalise_license("cc0-1.0") assert v.family.key == "cc0" def test_copyleft_family(self): assert normalise_license("gpl-3.0").family.key == "copyleft" assert normalise_license("agpl-3.0").family.key == "copyleft" assert normalise_license("lgpl-2.1").family.key == "copyleft" def test_osi_family(self): assert normalise_license("mit").family.key == "osi" assert normalise_license("apache-2.0").family.key == "osi" assert normalise_license("bsd-3-clause").family.key == "osi" def test_data_family(self): assert normalise_license("pddl-1.0").family.key == "data" class TestNameInference: def test_cc_name_strips_version(self): assert normalise_license("cc-by-4.0").license.key == "cc-by" assert normalise_license("cc-by-nc-nd-4.0").license.key == "cc-by-nc-nd" assert normalise_license("cc-by-sa-3.0").license.key == "cc-by-sa" assert normalise_license("cc0-1.0").license.key == "cc0" assert normalise_license("cc-by-nc-sa-4.0").license.key == "cc-by-nc-sa" def test_non_cc_keeps_key(self): assert normalise_license("mit").license.key == "mit" assert normalise_license("gpl-3.0").license.key == "gpl-3" class TestHierarchyNavigation: def test_version_license_family_chain(self): v = normalise_license("CC BY-NC-ND 4.0") assert v.key == "cc-by-nc-nd-4.0" assert v.license.key == "cc-by-nc-nd" assert v.license.family.key == "cc" assert v.family.key == "cc" def test_str_representations(self): v = normalise_license("CC BY-NC-ND 4.0") assert str(v) == "cc-by-nc-nd-4.0" assert str(v.license) == "cc-by-nc-nd" assert str(v.family) == "cc" class TestFallback: def test_unknown_string(self): v = normalise_license("some-totally-unknown-license-xyz") assert v.key == "some-totally-unknown-license-xyz" assert v.family.key == "unknown" def test_empty_string(self): v = normalise_license("") assert v.key == "unknown" def test_whitespace_only(self): v = normalise_license(" ") assert v.key == "unknown" class TestBatchNormalisation: def test_basic_batch(self): results = normalise_licenses(["MIT", "Apache-2.0", "CC BY 4.0"]) assert [r.key for r in results] == ["mit", "apache-2.0", "cc-by-4.0"] def test_batch_preserves_order(self): raw = ["GPL-3.0", "MIT", "CC BY 4.0", "Apache-2.0"] expected = ["gpl-3.0", "mit", "cc-by-4.0", "apache-2.0"] assert [r.key for r in normalise_licenses(raw)] == expected def test_batch_accepts_generator(self): results = normalise_licenses(x for x in ["MIT", "ISC"]) assert results[0].key == "mit" def test_batch_empty(self): assert normalise_licenses([]) == [] src/licence_normaliser/tests/test_exceptions.py =============================================== src/licence_normaliser/tests/test_exceptions.py """Tests for strict mode and the public exception hierarchy.""" import pytest from licence_normaliser import normalise_license, normalise_licenses from licence_normaliser.exceptions import ( LicenseNormaliserError, LicenseNotFoundError, ) __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" class TestLicenseNotFoundError: def test_is_subclass_of_base(self): assert issubclass(LicenseNotFoundError, LicenseNormaliserError) def test_is_subclass_of_exception(self): assert issubclass(LicenseNotFoundError, Exception) def test_attributes(self): exc = LicenseNotFoundError("My License", "my license") assert exc.raw == "My License" assert exc.cleaned == "my license" def test_str_contains_raw(self): exc = LicenseNotFoundError("My License", "my license") assert "My License" in str(exc) def test_str_mentions_strict_false(self): exc = LicenseNotFoundError("x", "x") assert "strict=False" in str(exc) class TestStrictModeNormalise: def test_known_license_no_raise(self): # Known licenses must not raise in strict mode v = normalise_license("MIT", strict=True) assert v.key == "mit" def test_unknown_raises_license_not_found(self): with pytest.raises(LicenseNotFoundError) as exc_info: normalise_license("totally-unknown-xyz-9999", strict=True) assert exc_info.value.raw == "totally-unknown-xyz-9999" assert exc_info.value.cleaned == "totally-unknown-xyz-9999" def test_empty_string_raises(self): with pytest.raises(LicenseNotFoundError): normalise_license("", strict=True) def test_whitespace_only_raises(self): with pytest.raises(LicenseNotFoundError): normalise_license(" ", strict=True) def test_cc_url_known_no_raise(self): v = normalise_license( "https://creativecommons.org/licenses/by/4.0/", strict=True ) assert v.key == "cc-by-4.0" def test_strict_false_unknown_returns_unknown(self): # Default (strict=False): silently returns unknown v = normalise_license("no-such-license-xyzzy", strict=False) assert v.family.key == "unknown" def test_strict_default_is_false(self): # Calling without strict kwarg should not raise v = normalise_license("no-such-license-xyzzy") assert v.family.key == "unknown" class TestStrictModeBatch: def test_all_known_no_raise(self): results = normalise_licenses(["MIT", "Apache-2.0"], strict=True) assert len(results) == 2 assert results[0].key == "mit" assert results[1].key == "apache-2.0" def test_one_unknown_raises(self): with pytest.raises(LicenseNotFoundError): normalise_licenses(["MIT", "no-such-license-xyz"], strict=True) def test_non_strict_batch_with_unknown(self): results = normalise_licenses(["MIT", "no-such-license-xyz"], strict=False) assert results[0].key == "mit" assert results[1].family.key == "unknown" def test_empty_batch_strict(self): # Empty input should not raise even in strict mode assert normalise_licenses([], strict=True) == [] src/licence_normaliser/tests/test_integration.py ================================================ src/licence_normaliser/tests/test_integration.py """Comprehensive integration tests covering the full license matrix. Each tuple: (input_string, expected_version_key, expected_license_key, expected_family_key) """ import pytest from licence_normaliser import ( LicenseNormalisationError, LicenseNotFoundError, LicenseVersion, normalise_license, normalise_licenses, ) LICENSE_MATRIX = [ # raw,expected_key,expected_license,expected_family # === OSI-approved licenses === ("mit", "mit", "mit", "osi"), ("MIT", "mit", "mit", "osi"), (" mit ", "mit", "mit", "osi"), ("apache-2.0", "apache-2.0", "apache", "osi"), ("Apache-2.0", "apache-2.0", "apache", "osi"), ("Apache 2.0", "apache-2.0", "apache", "osi"), ("Apache License 2.0", "apache-2.0", "apache", "osi"), ( "BSD 3-Clause", "bsd-3-clause", "bsd-3-clause", "osi", ), # Resolves to bsd-3-clause/osi, matches SPDX and alias entries ("bsd-3-clause", "bsd-3-clause", "bsd-3-clause", "osi"), ("BSD License", "bsd-3-clause", "bsd-3-clause", "osi"), ("MPL-2.0", "mpl-2.0", "mpl", "osi"), ("mpl-2.0", "mpl-2.0", "mpl", "osi"), ( "Mozilla Public License 2.0", "mpl-2.0", "mpl", "osi", ), # Canonical full name of MPL-2.0, matches alias entry ("ISC", "isc", "isc", "osi"), ("isc", "isc", "isc", "osi"), ("ISC License", "isc", "isc", "osi"), ("Unlicense", "unlicense", "unlicense", "osi"), ("unlicense", "unlicense", "unlicense", "osi"), ("WTFPL", "wtfpl", "wtfpl", "osi"), ("wtfpl", "wtfpl", "wtfpl", "osi"), ("Zlib", "zlib", "zlib", "osi"), ("zlib", "zlib", "zlib", "osi"), # === GPL / AGPL / LGPL (copyleft) === ("gpl-3.0", "gpl-3.0", "gpl-3", "copyleft"), ("GPL-3.0", "gpl-3.0", "gpl-3", "copyleft"), ("gpl-3.0+", "gpl-3.0", "gpl-3", "copyleft"), ( "gpl-3-0", "gpl-3-0", "gpl-3-0", "copyleft", ), # NOTE: hyphen instead of dot; resolver recognises gpl but doesn't normalise ("GNU GPL v3", "gpl-3.0", "gpl-3", "copyleft"), ("GPL v3", "gpl-3.0", "gpl-3", "copyleft"), ("gpl-2.0", "gpl-2.0", "gpl-2", "copyleft"), ("GPL v2", "gpl-2.0", "gpl-2", "copyleft"), ("lgpl-3.0", "lgpl-3.0", "lgpl-3", "copyleft"), ("LGPL-3.0", "lgpl-3.0", "lgpl-3", "copyleft"), ("lgpl-2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"), ("LGPL v2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"), ("lgpl v2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"), ("agpl-3.0", "agpl-3.0", "agpl-3", "copyleft"), ("AGPL v3", "agpl-3.0", "agpl-3", "copyleft"), # === Creative Commons === ("CC BY 4.0", "cc-by-4.0", "cc-by", "cc"), ("cc by 4.0", "cc-by-4.0", "cc-by", "cc"), ("cc-by-4.0", "cc-by-4.0", "cc-by", "cc"), ("CC BY 3.0", "cc-by-3.0", "cc-by", "cc"), ("cc by 3.0", "cc-by-3.0", "cc-by", "cc"), ("cc-by-3.0", "cc-by-3.0", "cc-by", "cc"), ("CC BY 2.5", "cc-by-2.5", "cc-by", "cc"), ("CC BY 2.0", "cc-by-2.0", "cc-by", "cc"), ("CC BY 1.0", "cc-by-1.0", "cc-by", "cc"), ("cc by", "cc-by", "cc-by", "cc"), ( "CC-BY", "cc-by", "cc-by", "cc", ), # SPDX form, resolves to cc-by/cc ("CC BY-NC 4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"), ("cc by-nc 4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"), ("cc-by-nc-4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"), ("CC BY-NC 3.0", "cc-by-nc-3.0", "cc-by-nc", "cc"), ("CC BY-NC-SA 4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"), ("cc by-nc-sa 4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"), ("cc-by-nc-sa-4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"), ("CC BY-NC-SA 3.0", "cc-by-nc-sa-3.0", "cc-by-nc-sa", "cc"), ("CC BY-NC-ND 4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"), ("cc by-nc-nd 4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"), ("cc-by-nc-nd-4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"), ("CC BY-NC-ND 3.0", "cc-by-nc-nd-3.0", "cc-by-nc-nd", "cc"), ("cc by-nc-nd 3.0", "cc-by-nc-nd-3.0", "cc-by-nc-nd", "cc"), ("CC BY-ND 4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"), ("cc by-nd 4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"), ("cc-by-nd-4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"), ("CC BY-SA 4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"), ("cc by-sa 4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"), ("cc-by-sa-4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"), ("CC BY-SA 3.0", "cc-by-sa-3.0", "cc-by-sa", "cc"), ("cc-by-3.0-igo", "cc-by-3.0-igo", "cc-by", "cc"), ("cc-by-nc-nd-3.0-igo", "cc-by-nc-nd-3.0-igo", "cc-by-nc-nd", "cc"), # CC0 ("CC0 1.0", "cc0-1.0", "cc0", "cc0"), ("cc0 1.0", "cc0-1.0", "cc0", "cc0"), ("cc0-1.0", "cc0-1.0", "cc0", "cc0"), ("CC0", "cc0-1.0", "cc0", "cc0"), ("cc0", "cc0-1.0", "cc0", "cc0"), ("cc-zero", "cc0-1.0", "cc0", "cc0"), ("CC Zero", "cc0-1.0", "cc0", "cc0"), ("CC-Zero", "cc0-1.0", "cc0", "cc0"), ("creative commons zero", "cc0-1.0", "cc0", "cc0"), ("Creative Commons Zero 1.0", "cc0-1.0", "cc0", "cc0"), # CC-PDM ("cc-pdm", "cc-pdm-1.0", "cc-pdm", "public-domain"), ("CC-PDM", "cc-pdm-1.0", "cc-pdm", "public-domain"), ("cc-pdm-1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"), ("CC-PDM 1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"), ("cc-pdm 1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"), ("creative commons public domain", "cc-pdm-1.0", "cc-pdm", "public-domain"), # CC shorthand ("creative commons by", "cc-by", "cc-by", "cc"), ("creative commons by 4.0", "cc-by-4.0", "cc-by", "cc"), ( "creative commons by-sa", "cc-by-sa", "cc-by-sa", "cc", ), # Specifies by-sa, license must be cc-by-sa ( "creative commons by-nc", "cc-by-nc", "cc-by-nc", "cc", ), # Specifies by-nc, license must be cc-by-nc ( "creative commons by-nc-sa", "cc-by-nc-sa", "cc-by-nc-sa", "cc", ), # Specifies by-nc-sa, license must be cc-by-nc-sa ( "creative commons by-nc-nd", "cc-by-nc-nd", "cc-by-nc-nd", "cc", ), # Specifies by-nc-nd, license must be cc-by-nc-nd ( "creative commons by-nd", "cc-by-nd", "cc-by-nd", "cc", ), # Specifies by-nd, license must be cc-by-nd # CC URLs ( "http://creativecommons.org/licenses/by-nc-nd/4.0/", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc", ), ("https://creativecommons.org/licenses/by/4.0/", "cc-by-4.0", "cc-by", "cc"), ("http://creativecommons.org/licenses/by/4.0/", "cc-by-4.0", "cc-by", "cc"), ( "https://creativecommons.org/licenses/by-nc/4.0/", "cc-by-nc-4.0", "cc-by-nc", "cc", ), ( "https://creativecommons.org/licenses/by-nc-sa/4.0/", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc", ), ( "https://creativecommons.org/licenses/by-nd/4.0/", "cc-by-nd-4.0", "cc-by-nd", "cc", ), ( "https://creativecommons.org/licenses/by-sa/4.0/", "cc-by-sa-4.0", "cc-by-sa", "cc", ), ( "http://creativecommons.org/licenses/by-nc-nd/3.0/igo/", "cc-by-nc-nd-3.0-igo", "cc-by-nc-nd", "cc", ), ( "https://creativecommons.org/licenses/by/3.0/igo/", "cc-by-3.0-igo", "cc-by", "cc", ), ("https://creativecommons.org/publicdomain/zero/1.0/", "cc0-1.0", "cc0", "cc0"), ("http://creativecommons.org/publicdomain/zero/1.0/", "cc0-1.0", "cc0", "cc0"), # CC prose ("licensed under cc by-nc-nd 4.0 terms", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"), ( "content is licensed under creative commons by-nc-sa", "cc-by-nc-sa", "cc-by-nc-sa", # Contains by-nc-sa, license must be cc-by-nc-sa "cc", ), ("this content is under creative commons by license", "cc-by", "cc-by", "cc"), # Open Data ("ODbL", "odbl", "odbl", "open-data"), ("odbl", "odbl", "odbl", "open-data"), ("Open Database License", "odbl", "odbl", "open-data"), ("ODC-BY", "odc-by", "odc-by", "open-data"), ("odc-by", "odc-by", "odc-by", "open-data"), ("PDDL", "pddl", "pddl", "open-data"), ("pddl", "pddl", "pddl", "open-data"), ( "Open Data Commons Public Domain Dedication", "public-domain", "public-domain", "public-domain", ), # Publisher ("elsevier-oa", "elsevier-oa", "elsevier-oa", "publisher-oa"), ( "Elsevier OA", "elsevier-oa", "elsevier-oa", "publisher-oa", ), # "Elsevier OA" unambiguously identifies Elsevier OA license ("elsevier tdm", "elsevier-tdm", "elsevier-tdm", "publisher-tdm"), ("Elsevier TDM", "elsevier-tdm", "elsevier-tdm", "publisher-tdm"), ("Elsevier User License", "elsevier-oa", "elsevier-oa", "publisher-oa"), ( "https://www.elsevier.com/open-access/userlicense/1.0/", "elsevier-oa", "elsevier-oa", "publisher-oa", ), ("wiley-tdm", "wiley-tdm", "wiley-tdm", "publisher-tdm"), ("Wiley TDM", "wiley-tdm", "wiley-tdm", "publisher-tdm"), ("wiley vor", "wiley-vor", "wiley-vor", "publisher-proprietary"), ("springer-tdm", "springer-tdm", "springer-tdm", "publisher-tdm"), ( "Springer Nature TDM", "springernature-tdm", "springernature-tdm", "publisher-tdm", ), ("acs-authorchoice", "acs-authorchoice", "acs-authorchoice", "publisher-oa"), ("ACS AuthorChoice", "acs-authorchoice", "acs-authorchoice", "publisher-oa"), ( "acs-authorchoice-ccby", "acs-authorchoice-ccby", "acs-authorchoice-ccby", "publisher-oa", ), ( "acs authorchoice cc by", "acs-authorchoice-ccby", "acs-authorchoice-ccby", "publisher-oa", ), ("aps-default", "aps-default", "aps-default", "publisher-proprietary"), ("APS Default", "aps-default", "aps-default", "publisher-proprietary"), ("iop-tdm", "iop-tdm", "iop-tdm", "publisher-tdm"), ("iop copyright", "iop-copyright", "iop-copyright", "publisher-proprietary"), ("bmj copyright", "bmj-copyright", "bmj-copyright", "publisher-proprietary"), ("rsc terms", "rsc-terms", "rsc-terms", "publisher-proprietary"), ("cup terms", "cup-terms", "cup-terms", "publisher-proprietary"), ("degruyter terms", "degruyter-terms", "degruyter-terms", "publisher-proprietary"), ("tandf terms", "tandf-terms", "tandf-terms", "publisher-proprietary"), ( "sage permissions", "sage-permissions", "sage-permissions", "publisher-proprietary", ), ("wiley terms", "wiley-terms", "wiley-terms", "publisher-proprietary"), ("wiley am", "wiley-am", "wiley-am", "publisher-proprietary"), ("pnas licenses", "pnas-licenses", "pnas-licenses", "publisher-proprietary"), ( "aaas author reuse", "aaas-author-reuse", "aaas-author-reuse", "publisher-proprietary", ), ("aip rights", "aip-rights", "aip-rights", "publisher-proprietary"), ("jama cc by", "jama-cc-by", "jama-cc-by", "publisher-oa"), ("thieme nlm", "thieme-nlm", "thieme-nlm", "publisher-oa"), ("oup chorus", "oup-chorus", "oup-chorus", "publisher-oa"), ("implied oa", "implied-oa", "implied-oa", "publisher-oa"), ("implied open access", "implied-oa", "implied-oa", "publisher-oa"), ("unspecified oa", "unspecified-oa", "unspecified-oa", "other-oa"), ( "publisher specific oa", "publisher-specific-oa", "publisher-specific-oa", "publisher-oa", ), ("author manuscript", "author-manuscript", "author-manuscript", "publisher-oa"), ("open access", "other-oa", "other-oa", "other-oa"), ("other-oa", "other-oa", "other-oa", "other-oa"), ( "all rights reserved", "all-rights-reserved", "all-rights-reserved", "publisher-proprietary", ), ("no reuse", "no-reuse", "no-reuse", "publisher-proprietary"), # Publisher prose ( "this article is licensed under elsevier tdm agreement", "elsevier-tdm", "elsevier-tdm", "publisher-tdm", ), ( "journal article under elsevier user license for open access", "elsevier-oa", "elsevier-oa", "publisher-oa", ), ( "acs authorchoice option was selected by the authors", "acs-authorchoice", "acs-authorchoice", "publisher-oa", ), ( "springer tdm policy applies to this content", "springer-tdm", "springer-tdm", "publisher-tdm", ), # Unknown ( "Totally Fake License XYZ999", "totally fake license xyz999", "totally fake license xyz999", "unknown", ), # Public domain ("public domain", "public-domain", "public-domain", "public-domain"), ("public-domain", "public-domain", "public-domain", "public-domain"), ("pd", "public-domain", "public-domain", "public-domain"), ] @pytest.mark.parametrize( "raw,expected_key,expected_license,expected_family", LICENSE_MATRIX ) def test_license_matrix(raw, expected_key, expected_license, expected_family): v = normalise_license(raw) assert v.key == expected_key, f"input: {raw!r} key: {v.key!r} != {expected_key!r}" assert v.license.key == expected_license, ( f"input: {raw!r} license: {v.license.key!r} != {expected_license!r}" ) assert v.family.key == expected_family, ( f"input: {raw!r} family: {v.family.key!r} != {expected_family!r}" ) def test_strict_mode_unknown_raises(): with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)): normalise_license("xyzzy unknown license 123", strict=True) def test_strict_mode_known_does_not_raise(): v = normalise_license("mit", strict=False) assert v.key == "mit" def test_empty_string_returns_unknown(): v = normalise_license("") assert v.key == "unknown" assert v.family.key == "unknown" def test_whitespace_only_returns_unknown(): v = normalise_license(" \n\t ") assert v.key == "unknown" def test_batch_normalise_preserves_order(): inputs = ["MIT", "Apache-2.0", "CC BY 4.0", "unknown garbage"] results = normalise_licenses(inputs) assert [r.key for r in results] == [ "mit", "apache-2.0", "cc-by-4.0", "unknown garbage", ] def test_normalise_mit(): v = normalise_license("MIT") assert isinstance(v, LicenseVersion) assert v.key == "mit" assert str(v) == "mit" assert str(v.license) == "mit" def test_normalise_cc(): v = normalise_license("CC BY 4.0") assert v.key == "cc-by-4.0" assert str(v.license) == "cc-by" assert str(v.family) == "cc" def test_batch(): results = normalise_licenses(["MIT", "Apache-2.0"]) assert len(results) == 2 assert results[0].key == "mit" assert results[1].key == "apache-2.0" def test_strict_mode_raises(): with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)): normalise_license("Totally Fake License XYZ999", strict=True) def test_strict_batch_raises(): with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)): normalise_licenses(["MIT", "Fake License XYZ999"], strict=True) def test_empty_input(): v = normalise_license("") assert v.key == "unknown" v = normalise_license(" ") assert v.key == "unknown" def test_real_world_license_strings(): """Test against real-world license strings collected from the wild.""" cases = [ ("http://creativecommons.org/licenses/by-nc-nd/4.0/", "cc-by-nc-nd-4.0"), ("http://creativecommons.org/licenses/by/4.0/", "cc-by-4.0"), ("http://creativecommons.org/licenses/by-nc/4.0/", "cc-by-nc-4.0"), ( "http://www.elsevier.com/open-access/userlicense/1.0/", "elsevier-oa", ), ( "http://creativecommons.org/licenses/by-nc-nd/3.0/igo/", "cc-by-nc-nd-3.0-igo", ), ("CC BY-NC-ND 4.0", "cc-by-nc-nd-4.0"), ( "http://creativecommons.org/licenses/by/3.0/igo/", "cc-by-3.0-igo", ), ] for raw, expected_key in cases: v = normalise_license(raw) assert v.key == expected_key, ( f"input: {raw!r} -> got {v.key!r}, want {expected_key!r}" ) src/licence_normaliser/tests/test_models.py =========================================== src/licence_normaliser/tests/test_models.py """Unit tests for _models.py.""" import pytest from licence_normaliser._models import LicenseFamily, LicenseName, LicenseVersion __author__ = "Artur Barseghyan " __copyright__ = "2026 Artur Barseghyan" __license__ = "MIT" def _cc_fam(): return LicenseFamily(key="cc") def _osi_fam(): return LicenseFamily(key="osi") def _cc_by_name(): return LicenseName(key="cc-by", family=_cc_fam()) def _mit_version(): return LicenseVersion( key="mit", url="https://opensource.org/licenses/MIT", license=LicenseName(key="mit", family=_osi_fam()), ) class TestLicenseFamily: def test_str(self): assert str(LicenseFamily(key="cc")) == "cc" def test_repr(self): assert repr(LicenseFamily(key="osi")) == "LicenseFamily('osi')" def test_eq_same_type(self): assert LicenseFamily(key="cc") == LicenseFamily(key="cc") def test_eq_str(self): assert LicenseFamily(key="cc") == "cc" def test_neq(self): assert LicenseFamily(key="cc") != LicenseFamily(key="osi") def test_hash_usable_in_set(self): s = {LicenseFamily(key="cc"), LicenseFamily(key="cc"), LicenseFamily(key="osi")} assert len(s) == 2 def test_frozen_prevents_mutation(self): fam = LicenseFamily(key="cc") with pytest.raises((AttributeError, TypeError)): fam.key = "other" # type: ignore class TestLicenseName: def test_str(self): assert str(_cc_by_name()) == "cc-by" def test_frozen_prevents_mutation(self): name = _cc_by_name() with pytest.raises((AttributeError, TypeError)): name.key = "other" # type: ignore def test_family_reference(self): assert _cc_by_name().family.key == "cc" class TestLicenseVersion: def test_str(self): assert str(_mit_version()) == "mit" def test_family_shortcut(self): assert _mit_version().family.key == "osi" def test_frozen_prevents_mutation(self): v = _mit_version() with pytest.raises((AttributeError, TypeError)): v.key = "other" # type: ignore def test_url_stored(self): assert _mit_version().url == "https://opensource.org/licenses/MIT" def test_url_none(self): v = LicenseVersion( key="unknown", url=None, license=LicenseName(key="unknown", family=LicenseFamily(key="unknown")), ) assert v.url is None src/licence_normaliser/tests/test_prose.py ========================================== src/licence_normaliser/tests/test_prose.py """Tests for prose pattern matching via ProseParser.""" from licence_normaliser import normalise_license class TestProsePatternMatching: def test_cc_by_nc_nd_4_0_prose(self): v = normalise_license("this work is licensed under cc by-nc-nd 4.0 terms") assert v.key == "cc-by-nc-nd-4.0" assert v.family.key == "cc" def test_cc_by_nc_nd_3_0_prose(self): v = normalise_license("license: cc by-nc-nd 3.0") assert v.key == "cc-by-nc-nd-3.0" assert v.family.key == "cc" def test_cc_by_nc_sa_creative_commons_prose(self): v = normalise_license("content licensed under creative commons by-nc-sa") assert v.key == "cc-by-nc-sa" assert v.family.key == "cc" def test_attribution_prose(self): v = normalise_license( "this content is made available under creative commons by license" ) assert v.key == "cc-by" assert v.family.key == "cc" def test_attribution_noncommercial_prose(self): v = normalise_license( "this article is licensed under attribution noncommercial terms" ) assert v.key == "cc-by-nc" assert v.family.key == "cc" def test_attribution_sharealike_prose(self): v = normalise_license("licensed under attribution share alike conditions") assert v.key == "cc-by-sa" assert v.family.key == "cc" def test_elsevier_tdm_prose(self): v = normalise_license( "this journal participates in text and data mining as " "permitted by the elsevier tdm agreement" ) assert v.key == "elsevier-tdm" assert v.family.key == "publisher-tdm" def test_elsevier_user_license_prose(self): v = normalise_license( "elsevier user license applies to this open access article" ) assert v.key == "elsevier-oa" assert v.family.key == "publisher-oa" def test_acs_authorchoice_prose(self): v = normalise_license("acs authorchoice option was selected by the authors") assert v.key == "acs-authorchoice" assert v.family.key == "publisher-oa" def test_all_rights_reserved_prose(self): v = normalise_license("all rights reserved except as permitted by law") assert v.key == "all-rights-reserved" assert v.family.key == "publisher-proprietary" def test_short_string_via_registry(self): v = normalise_license("cc by-nc-nd") assert v.key == "cc-by-nc-nd" assert v.family.key == "cc" def test_open_access_prose_matched(self): v = normalise_license("open access article available now") assert v.key == "other-oa" assert v.family.key == "other-oa" src/licence_normaliser/tests/test_publisher.py ============================================== src/licence_normaliser/tests/test_publisher.py """Tests for PublisherParser - publisher URLs and shorthand aliases.""" from licence_normaliser import normalise_license class TestPublisherUrls: def test_elsevier_oa_url(self): v = normalise_license("https://www.elsevier.com/open-access/userlicense/1.0/") assert v.key == "elsevier-oa" assert v.family.key == "publisher-oa" def test_elsevier_oa_url_http(self): v = normalise_license("http://www.elsevier.com/open-access/userlicense/1.0/") assert v.key == "elsevier-oa" assert v.family.key == "publisher-oa" def test_elsevier_tdm_url(self): v = normalise_license("https://www.elsevier.com/tdm/userlicense/1.0/") assert v.key == "elsevier-tdm" assert v.family.key == "publisher-tdm" def test_wiley_tdm_url(self): v = normalise_license("http://doi.wiley.com/10.1002/tdm_license_1") assert v.key == "wiley-tdm" assert v.family.key == "publisher-tdm" def test_wiley_terms_url(self): v = normalise_license("https://onlinelibrary.wiley.com/terms-and-conditions") assert v.key == "wiley-terms" assert v.family.key == "publisher-proprietary" def test_springer_tdm_url(self): v = normalise_license("https://www.springer.com/tdm") assert v.key == "springer-tdm" assert v.family.key == "publisher-tdm" def test_springernature_tdm_url(self): v = normalise_license( "https://www.springernature.com/gp/researchers/text-and-data-mining" ) assert v.key == "springernature-tdm" assert v.family.key == "publisher-tdm" def test_acs_authorchoice_ccby_url(self): v = normalise_license( "https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html" ) assert v.key == "acs-authorchoice-ccby" assert v.family.key == "publisher-oa" def test_acs_authorchoice_url(self): v = normalise_license( "https://pubs.acs.org/page/policy/authorchoice_termsofuse.html" ) assert v.key == "acs-authorchoice" assert v.family.key == "publisher-oa" def test_acs_authorchoice_nih_url(self): v = normalise_license( "https://pubs.acs.org/page/policy/" "acs_authorchoice_with_nih_addendum_termsofuse.html" ) assert v.key == "acs-authorchoice-nih" assert v.family.key == "publisher-oa" def test_rsc_terms_url(self): v = normalise_license( "https://www.rsc.org/journals-books-databases/" "journal-authors-reviewers/licences-copyright-permissions/" ) assert v.key == "rsc-terms" assert v.family.key == "publisher-proprietary" def test_iop_tdm_url(self): v = normalise_license( "https://iopscience.iop.org/info/page/text-and-data-mining" ) assert v.key == "iop-tdm" assert v.family.key == "publisher-tdm" def test_bmj_copyright_url(self): v = normalise_license( "https://www.bmj.com/company/legal-stuff/copyright-notice/" ) assert v.key == "bmj-copyright" assert v.family.key == "publisher-proprietary" def test_aaas_author_reuse_url(self): v = normalise_license( "https://www.science.org/content/page/science-licenses-journal-article-reuse" ) assert v.key == "aaas-author-reuse" assert v.family.key == "publisher-proprietary" def test_aps_default_url(self): v = normalise_license("https://link.aps.org/licenses/aps-default-license") assert v.key == "aps-default" assert v.family.key == "publisher-proprietary" def test_aps_tdm_url(self): v = normalise_license( "https://link.aps.org/licenses/aps-default-text-mining-license" ) assert v.key == "aps-tdm" assert v.family.key == "publisher-tdm" def test_cup_terms_url(self): v = normalise_license("https://www.cambridge.org/core/terms") assert v.key == "cup-terms" assert v.family.key == "publisher-proprietary" def test_aip_rights_url(self): v = normalise_license( "https://publishing.aip.org/authors/rights-and-permissions" ) assert v.key == "aip-rights" assert v.family.key == "publisher-proprietary" def test_jama_cc_by_url(self): v = normalise_license("https://jamanetwork.com/pages/cc-by-license-permissions") assert v.key == "jama-cc-by" assert v.family.key == "publisher-oa" def test_oup_chorus_url(self): v = normalise_license( "https://academic.oup.com/journals/pages/open_access/" "funder_policies/chorus/standard_publication_model" ) assert v.key == "oup-chorus" assert v.family.key == "publisher-oa" def test_oup_terms_url(self): v = normalise_license( "https://academic.oup.com/pages/standard-publication-reuse-rights" ) assert v.key == "oup-terms" assert v.family.key == "publisher-proprietary" def test_sage_permissions_url(self): v = normalise_license("https://us.sagepub.com/en-us/nam/journals-permissions") assert v.key == "sage-permissions" assert v.family.key == "publisher-proprietary" def test_tandf_terms_url(self): v = normalise_license("https://www.tandfonline.com/action/showCopyRight") assert v.key == "tandf-terms" assert v.family.key == "publisher-proprietary" def test_gnu_gpl_url(self): v = normalise_license("https://www.gnu.org/licenses/gpl-3.0.html") assert v.key == "gpl-3.0" assert v.family.key == "copyleft" class TestPublisherShorthand: def test_elsevier_user_license(self): v = normalise_license("elsevier user license") assert v.key == "elsevier-oa" assert v.family.key == "publisher-oa" def test_elsevier_tdm_shorthand(self): v = normalise_license("elsevier tdm") assert v.key == "elsevier-tdm" assert v.family.key == "publisher-tdm" def test_wiley_tdm_shorthand(self): v = normalise_license("wiley tdm license") assert v.key == "wiley-tdm" assert v.family.key == "publisher-tdm" def test_wiley_vor(self): v = normalise_license("wiley vor") assert v.key == "wiley-vor" assert v.family.key == "publisher-proprietary" def test_wiley_am(self): v = normalise_license("wiley am") assert v.key == "wiley-am" assert v.family.key == "publisher-proprietary" def test_springer_tdm_shorthand(self): v = normalise_license("springer tdm") assert v.key == "springer-tdm" assert v.family.key == "publisher-tdm" def test_springer_nature_tdm_shorthand(self): v = normalise_license("springer nature tdm") assert v.key == "springernature-tdm" assert v.family.key == "publisher-tdm" def test_acs_authorchoice_shorthand(self): v = normalise_license("acs authorchoice") assert v.key == "acs-authorchoice" assert v.family.key == "publisher-oa" def test_acs_authorchoice_ccby_shorthand(self): v = normalise_license("acs authorchoice cc by") assert v.key == "acs-authorchoice-ccby" assert v.family.key == "publisher-oa" def test_acs_authorchoice_nih_shorthand(self): v = normalise_license("acs authorchoice nih") assert v.key == "acs-authorchoice-nih" assert v.family.key == "publisher-oa" def test_rsc_terms_shorthand(self): v = normalise_license("rsc terms") assert v.key == "rsc-terms" assert v.family.key == "publisher-proprietary" def test_iop_tdm_shorthand(self): v = normalise_license("iop tdm") assert v.key == "iop-tdm" assert v.family.key == "publisher-tdm" def test_iop_copyright_shorthand(self): v = normalise_license("iop copyright") assert v.key == "iop-copyright" assert v.family.key == "publisher-proprietary" def test_bmj_copyright_shorthand(self): v = normalise_license("bmj copyright") assert v.key == "bmj-copyright" assert v.family.key == "publisher-proprietary" def test_aaas_author_reuse_shorthand(self): v = normalise_license("aaas author reuse") assert v.key == "aaas-author-reuse" assert v.family.key == "publisher-proprietary" def test_pnas_licenses_shorthand(self): v = normalise_license("pnas licenses") assert v.key == "pnas-licenses" assert v.family.key == "publisher-proprietary" def test_aps_default_shorthand(self): v = normalise_license("aps default") assert v.key == "aps-default" assert v.family.key == "publisher-proprietary" def test_aps_tdm_shorthand(self): v = normalise_license("aps tdm") assert v.key == "aps-tdm" assert v.family.key == "publisher-tdm" def test_cup_terms_shorthand(self): v = normalise_license("cup terms") assert v.key == "cup-terms" assert v.family.key == "publisher-proprietary" def test_aip_rights_shorthand(self): v = normalise_license("aip rights") assert v.key == "aip-rights" assert v.family.key == "publisher-proprietary" def test_jama_cc_by_shorthand(self): v = normalise_license("jama cc by") assert v.key == "jama-cc-by" assert v.family.key == "publisher-oa" def test_degruyter_terms_shorthand(self): v = normalise_license("degruyter terms") assert v.key == "degruyter-terms" assert v.family.key == "publisher-proprietary" def test_oup_chorus_shorthand(self): v = normalise_license("oup chorus") assert v.key == "oup-chorus" assert v.family.key == "publisher-oa" def test_oup_terms_shorthand(self): v = normalise_license("oup terms") assert v.key == "oup-terms" assert v.family.key == "publisher-proprietary" def test_sage_permissions_shorthand(self): v = normalise_license("sage permissions") assert v.key == "sage-permissions" assert v.family.key == "publisher-proprietary" def test_tandf_terms_shorthand(self): v = normalise_license("tandf terms") assert v.key == "tandf-terms" assert v.family.key == "publisher-proprietary" def test_thieme_nlm_shorthand(self): v = normalise_license("thieme nlm") assert v.key == "thieme-nlm" assert v.family.key == "publisher-oa" class TestPublisherDirectKeys: def test_elsevier_tdm_key(self): v = normalise_license("elsevier-tdm") assert v.key == "elsevier-tdm" assert v.family.key == "publisher-tdm" def test_elsevier_oa_key(self): v = normalise_license("elsevier-oa") assert v.key == "elsevier-oa" assert v.family.key == "publisher-oa" def test_wiley_tdm_key(self): v = normalise_license("wiley-tdm") assert v.key == "wiley-tdm" assert v.family.key == "publisher-tdm" def test_acs_authorchoice_key(self): v = normalise_license("acs-authorchoice") assert v.key == "acs-authorchoice" assert v.family.key == "publisher-oa" def test_acs_authorchoice_ccby_key(self): v = normalise_license("acs-authorchoice-ccby") assert v.key == "acs-authorchoice-ccby" assert v.family.key == "publisher-oa" def test_acs_authorchoice_nih_key(self): v = normalise_license("acs-authorchoice-nih") assert v.key == "acs-authorchoice-nih" assert v.family.key == "publisher-oa" def test_iop_tdm_key(self): v = normalise_license("iop-tdm") assert v.key == "iop-tdm" assert v.family.key == "publisher-tdm" def test_aps_tdm_key(self): v = normalise_license("aps-tdm") assert v.key == "aps-tdm" assert v.family.key == "publisher-tdm" def test_oup_chorus_key(self): v = normalise_license("oup-chorus") assert v.key == "oup-chorus" assert v.family.key == "publisher-oa" def test_jama_cc_by_key(self): v = normalise_license("jama-cc-by") assert v.key == "jama-cc-by" assert v.family.key == "publisher-oa" def test_thieme_nlm_key(self): v = normalise_license("thieme-nlm") assert v.key == "thieme-nlm" assert v.family.key == "publisher-oa" def test_implied_oa_key(self): v = normalise_license("implied-oa") assert v.key == "implied-oa" assert v.family.key == "publisher-oa" def test_unspecified_oa_key(self): v = normalise_license("unspecified-oa") assert v.key == "unspecified-oa" assert v.family.key == "other-oa" def test_author_manuscript_key(self): v = normalise_license("author-manuscript") assert v.key == "author-manuscript" assert v.family.key == "publisher-oa" def test_all_rights_reserved_key(self): v = normalise_license("all-rights-reserved") assert v.key == "all-rights-reserved" assert v.family.key == "publisher-proprietary" def test_no_reuse_key(self): v = normalise_license("no-reuse") assert v.key == "no-reuse" assert v.family.key == "publisher-proprietary" def test_other_oa_key(self): v = normalise_license("other-oa") assert v.key == "other-oa" assert v.family.key == "other-oa" def test_public_domain_key(self): v = normalise_license("public-domain") assert v.key == "public-domain" assert v.family.key == "public-domain" def test_open_access_key(self): v = normalise_license("open-access") assert v.key == "other-oa" assert v.family.key == "other-oa" class TestPublisherCatchAll: def test_implied_oa_shorthand(self): v = normalise_license("implied oa") assert v.key == "implied-oa" assert v.family.key == "publisher-oa" def test_unspecified_oa_shorthand(self): v = normalise_license("unspecified oa") assert v.key == "unspecified-oa" assert v.family.key == "other-oa" def test_open_access_shorthand(self): v = normalise_license("open access") assert v.key == "other-oa" assert v.family.key == "other-oa" def test_author_manuscript_shorthand(self): v = normalise_license("author manuscript") assert v.key == "author-manuscript" assert v.family.key == "publisher-oa" def test_all_rights_reserved_shorthand(self): v = normalise_license("all rights reserved") assert v.key == "all-rights-reserved" assert v.family.key == "publisher-proprietary" def test_no_reuse_shorthand(self): v = normalise_license("no reuse") assert v.key == "no-reuse" assert v.family.key == "publisher-proprietary" class TestCCPublicDomain: def test_cc_pdm_bare_key(self): v = normalise_license("cc-pdm") assert v.key == "cc-pdm-1.0" assert v.family.key == "public-domain" def test_cc_pdm_versioned_key(self): v = normalise_license("cc-pdm-1.0") assert v.key == "cc-pdm-1.0" assert v.family.key == "public-domain" def test_cc0_bare_key(self): v = normalise_license("cc0") assert v.key == "cc0-1.0" assert v.family.key == "cc0" def test_cc0_versioned_key(self): v = normalise_license("cc0-1.0") assert v.key == "cc0-1.0" assert v.family.key == "cc0" def test_cc_zero_shorthand(self): v = normalise_license("cc-zero") assert v.key == "cc0-1.0" assert v.family.key == "cc0" def test_public_domain_fallback(self): v = normalise_license("public-domain") assert v.key == "public-domain" assert v.family.key == "public-domain" def test_creative_commons_zero(self): v = normalise_license("creative commons zero") assert v.key == "cc0-1.0" assert v.family.key == "cc0" def test_creative_commons_public_domain(self): v = normalise_license("creative commons public domain") assert v.key == "cc-pdm-1.0" assert v.family.key == "public-domain"