Project source-tree
Below is the layout of the project (to 10 levels), followed by the contents of each key file.
Project directory layout
licence-normaliser/
├── scripts
│ ├── __init__.py
│ ├── check_missing_aliases.py
│ ├── compare_datasets.py
│ ├── README.rst
│ └── test_name_inference.py
├── src
│ └── licence_normaliser
│ ├── cli
│ │ ├── __init__.py
│ │ └── _main.py
│ ├── data
│ │ ├── aliases
│ │ │ └── aliases.json
│ │ ├── prose
│ │ │ └── prose_patterns.json
│ │ ├── publishers
│ │ │ └── publishers.json
│ │ ├── urls
│ │ │ └── url_map.json
│ │ └── README.rst
│ ├── parsers
│ │ ├── __init__.py
│ │ ├── alias.py
│ │ ├── creativecommons.py
│ │ ├── opendefinition.py
│ │ ├── osi.py
│ │ ├── prose.py
│ │ ├── publisher.py
│ │ ├── scancode_licensedb.py
│ │ └── spdx.py
│ ├── tests
│ │ ├── __init__.py
│ │ ├── conftest.py
│ │ ├── test_aliases.py
│ │ ├── test_cache.py
│ │ ├── test_cli.py
│ │ ├── test_core.py
│ │ ├── test_exceptions.py
│ │ ├── test_integration.py
│ │ ├── test_models.py
│ │ ├── test_prose.py
│ │ └── test_publisher.py
│ ├── __init__.py
│ ├── _cache.py
│ ├── _core.py
│ ├── _models.py
│ ├── _normaliser.py
│ ├── _trace.py
│ ├── defaults.py
│ ├── exceptions.py
│ ├── plugins.py
│ └── py.typed
├── AGENTS.md
├── conftest.py
├── CONTRIBUTING.rst
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── pyproject.toml
├── README.rst
└── tox.ini
README.rst
README.rst
==================
licence-normaliser
==================
.. image:: https://raw.githubusercontent.com/barseghyanartur/licence-normaliser/main/docs/_static/licence_normaliser_logo.webp
:alt: licence-normaliser logo
:align: center
Comprehensive license normalsation with a three-level hierarchy.
.. image:: https://img.shields.io/pypi/v/licence-normaliser.svg
:target: https://pypi.python.org/pypi/licence-normaliser
:alt: PyPI Version
.. image:: https://img.shields.io/pypi/pyversions/licence-normaliser.svg
:target: https://pypi.python.org/pypi/licence-normaliser/
:alt: Supported Python versions
.. image:: https://github.com/barseghyanartur/licence-normaliser/actions/workflows/test.yml/badge.svg?branch=main
:target: https://github.com/barseghyanartur/licence-normaliser/actions
:alt: Build Status
.. image:: https://readthedocs.org/projects/licence-normaliser/badge/?version=latest
:target: http://licence-normaliser.readthedocs.io
:alt: Documentation Status
.. image:: https://img.shields.io/badge/docs-llms.txt-blue
:target: https://licence-normaliser.readthedocs.io/en/latest/llms.txt
:alt: llms.txt - documentation for LLMs
.. image:: https://img.shields.io/badge/license-MIT-blue.svg
:target: https://github.com/barseghyanartur/licence-normaliser/#License
:alt: MIT
.. image:: https://coveralls.io/repos/github/barseghyanartur/licence-normaliser/badge.svg?branch=main&service=github
:target: https://coveralls.io/github/barseghyanartur/licence-normaliser?branch=main
:alt: Coverage
``licence-normaliser`` is a comprehensive license normalisation library that
maps any license representation (SPDX tokens, URLs, prose descriptions) to a
canonical three-level hierarchy.
Features
========
- **Three-level hierarchy** - LicenseFamily → LicenseName → LicenseVersion.
- **Wide format support** - SPDX tokens, URLs, prose descriptions.
- **Creative Commons support** - Full CC family with versions and IGO variants.
- **Publisher-specific licenses** - Springer, Nature, Elsevier, Wiley, ACS,
and more.
- **File-driven data** - Add aliases, URLs, and patterns by editing JSON files.
No Python code changes required for new synonyms.
- **Pluggable parsers** - Drop in a new parser class to ingest
any external license registry. Parsers implement plugin interfaces
(``RegistryPlugin``, ``URLPlugin``, etc.).
- **Strict mode** - Raise ``LicenseNotFoundError`` instead of silently
returning ``"unknown"``.
- **Caching** - LRU caching for performance.
- **CLI** - Command-line interface with ``--strict`` and ``--explain`` support.
Hierarchy
=========
The library uses a three-level hierarchy:
1. **LicenseFamily** - broad bucket: ``"cc"``, ``"osi"``, ``"copyleft"``,
``"publisher-tdm"``, ...
2. **LicenseName** - version-free: ``"cc-by"``, ``"cc-by-nc-nd"``, ``"mit"``,
``"wiley-tdm"``
3. **LicenseVersion** - fully resolved: ``"cc-by-3.0"``, ``"cc-by-nc-nd-4.0"``
Installation
============
With ``uv``:
.. code-block:: sh
uv pip install licence-normaliser
Or with ``pip``:
.. code-block:: sh
pip install licence-normaliser
Quick start
===========
.. code-block:: python
:name: test_quick_start
from licence_normaliser import normalise_license
v = normalise_license("CC BY-NC-ND 4.0")
str(v) # "cc-by-nc-nd-4.0" ← LicenseVersion
str(v.license) # "cc-by-nc-nd" ← LicenseName
str(v.license.family) # "cc" ← LicenseFamily
Strict mode
===========
By default, unresolvable inputs return an ``"unknown"`` result. Pass
``strict=True`` to raise ``LicenseNotFoundError`` instead:
.. code-block:: python
:name: test_strict_mode
from licence_normaliser import normalise_license
from licence_normaliser.exceptions import LicenseNotFoundError
# Silent fallback (default)
v = normalise_license("some-unknown-string")
v.family.key # "unknown"
# Strict: raises on unresolvable input
try:
v = normalise_license("some-unknown-string", strict=True)
except LicenseNotFoundError as exc:
print(exc.raw) # original input
print(exc.cleaned) # cleaned form that failed lookup
Trace / Explain
===============
Set ``ENABLE_LICENCE_NORMALISER_TRACE=1`` or pass ``trace=True`` to get
resolution traces showing how the license was matched:
.. code-block:: python
:name: test_trace
from licence_normaliser import normalise_license
# Via function
v = normalise_license("cc by-nc-nd 3.0 igo", trace=True)
print(v.explain())
# Via class
from licence_normaliser import LicenseNormaliser
ln = LicenseNormaliser(trace=True)
v = ln.normalise_license("MIT")
print(v.explain())
Output shows the resolution pipeline (alias → registry → url → prose →
fallback) and which source file + line matched:
.. code-block:: text
Input: 'cc by-nc-nd 3.0 igo' → 'cc by-nc-nd 3.0 igo'
[✓] alias: 'cc by-nc-nd 3.0 igo' → 'cc-by-nc-nd-3.0-igo' (line 139 in aliases.json)
Result:
version_key: 'cc-by-nc-nd-3.0-igo'
name_key: 'cc-by-nc-nd'
family_key: 'cc'
The trace can also be accessed via ``v._trace`` for programmatic use.
Batch normalisation
===================
.. code-block:: python
:name: test_batch_normalisation
from licence_normaliser import normalise_licenses
results = normalise_licenses(["MIT", "Apache-2.0", "CC BY 4.0"])
for r in results:
print(r.key)
# Strict batch - raises on first unresolvable
results = normalise_licenses(["MIT", "Apache-2.0"], strict=True)
Custom plugins
==============
The ``LicenseNormaliser`` class lets you inject custom plugin classes for
specialised use cases:
.. code-block:: python
:name: test_custom_plugins
from licence_normaliser import LicenseNormaliser
from licence_normaliser.parsers.alias import AliasParser
from licence_normaliser.parsers.spdx import SPDXParser
# Use only SPDX + Alias plugins (no CC, no publisher URLs)
ln = LicenseNormaliser(
registry=[SPDXParser],
alias=[AliasParser],
family=[AliasParser],
name=[AliasParser],
cache=True,
cache_maxsize=8192,
)
# MIT resolves via SPDX parser
assert str(ln.normalise_license("MIT")) == "mit"
# CC BY resolves via Alias
assert str(ln.normalise_license("CC BY-NC-ND 4.0")) == "cc-by-nc-nd-4.0"
.. note::
Explicit plugin passing is optional — ``LicenseNormaliser()``
automatically loads defaults. Use the pattern above only if you need
custom plugins or reduce number of plugins loaded.
For caching, ``LicenseNormaliser`` wraps the resolution method
with ``lru_cache``.
Disable it by passing ``cache=False`` for debugging:
.. code-block:: python
:name: test_caching
from licence_normaliser import LicenseNormaliser
ln = LicenseNormaliser(cache=False)
result = ln.normalise_license("MIT")
Update data (CLI)
=================
.. code-block:: sh
licence-normaliser update-data --force
# Fetches fresh SPDX, OpenDefinition, OSI, CreativeCommons, and ScanCode JSONs
Integration tests (public API only)
===================================
All integration tests live in
``src/licence_normaliser/tests/test_integration.py``
and only import the public API.
CLI usage
=========
Normalise a single license:
.. code-block:: sh
licence-normaliser normalise "MIT"
# Output: mit
licence-normaliser normalise --full "CC BY 4.0"
# Output:
# Key: cc-by-4.0
# URL: https://creativecommons.org/licenses/by/4.0/
# License: cc-by
# Family: cc
licence-normaliser normalise --strict "totally-unknown"
# Exits with code 1 and prints an error
Batch normalise:
.. code-block:: sh
licence-normaliser batch MIT "Apache-2.0" "CC BY 4.0"
licence-normaliser batch --strict MIT "Apache-2.0"
Exceptions
==========
.. code-block:: python
:name: test_exceptions
from licence_normaliser.exceptions import (
LicenseNormaliserError, # base class
LicenseNotFoundError, # raised by strict mode
)
Testing
=======
All tests run inside Docker:
.. code-block:: sh
make test
To test a specific Python version:
.. code-block:: sh
make test-env ENV=py312
License
=======
MIT
Author
======
Artur Barseghyan <artur.barseghyan@gmail.com>
CONTRIBUTING.rst
CONTRIBUTING.rst
======================
Contributor guidelines
======================
.. _licence-normaliser: https://github.com/barseghyanartur/licence-normaliser/
.. _uv: https://docs.astral.sh/uv/
.. _tox: https://tox.wiki
.. _ruff: https://beta.ruff.rs/docs/
.. _doc8: https://doc8.readthedocs.io/
.. _pre-commit: https://pre-commit.com/#installation
.. _issues: https://github.com/barseghyanartur/licence-normaliser/issues
.. _discussions: https://github.com/barseghyanartur/licence-normaliser/discussions
.. _pull request: https://github.com/barseghyanartur/licence-normaliser/pulls
.. _versions manifest: https://github.com/actions/python-versions/blob/main/versions-manifest.json
Developer prerequisites
-----------------------
pre-commit
~~~~~~~~~~
Refer to `pre-commit`_ for installation instructions.
TL;DR:
.. code-block:: sh
curl -LsSf https://astral.sh/uv/install.sh | sh # Install uv
uv tool install pre-commit # Install pre-commit
pre-commit install # Install hooks
Installing `pre-commit`_ ensures all contributions adhere to the project's
code quality standards.
Code standards
--------------
`ruff`_ and `doc8`_ are triggered automatically by `pre-commit`_.
To run checks manually:
.. code-block:: sh
make doc8
make ruff
Import conventions
~~~~~~~~~~~~~~~~~~
**Import statements belong at module level.** Avoid placing imports inside
functions or methods unless absolutely necessary:
- **Acceptable exceptions:**
- Breaking circular dependencies
- Optional runtime dependencies (e.g., CLI-only imports)
- Heavy imports that are rarely used
- **Why this matters:**
- Improves code readability
- Makes dependencies explicit and discoverable
- Enables static analysis tools to work correctly
- Follows Python community best practices (PEP 8)
When in doubt, place imports at the top of the file.
Virtual environment
-------------------
.. code-block:: sh
make create-venv
Installation
------------
.. code-block:: sh
make install
Testing
-------
.. note::
Python 3.15 is being tested on GitHub CI, but not inside a local Docker image.
Docker-based testing (recommended)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All tests run inside Docker for platform independence and consistency:
.. code-block:: sh
make test # full matrix (Python 3.10-3.14)
make test-env ENV=py312 # single Python version
make shell # interactive shell in test container
make shell-env ENV=py312 # interactive shell for specific Python
Local testing (alternative)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
For faster iteration during development, you can run tests locally with ``uv``:
.. code-block:: sh
make install # one-time setup
uv run pytest # run all tests
uv run pytest path/to/test_something.py # run specific test
**Important**: If you encounter tooling errors with local testing, fall back to
Docker-based testing which is the canonical environment.
GitHub Actions
~~~~~~~~~~~~~~
In any case, GitHub Actions runs the full matrix automatically on every push.
Tests run on Python 3.10–3.15 (all non-EOL versions). See the
`versions manifest`_ for the full list of available Python versions.
Adding new normalisation rules
------------------------------
For a new **alias** or **family override** for an *existing* license:
1. Add an entry to ``src/licence_normaliser/data/aliases/aliases.json``.
2. Optionally, add an ``aliases`` array to define additional lookup variants
(e.g. hyphen vs space forms) that resolve to the same target:
.. code-block:: json
{
"cc by-nc": {
"version_key": "cc-by-nc",
"name_key": "cc-by-nc",
"family_key": "cc",
"aliases": ["cc-by-nc", "cc by nc", "cc-by nc"]
}
}
3. Add a test in ``src/licence_normaliser/tests/test_aliases.py`` or
``test_alias_expansion.py``.
4. No Python changes needed.
For a new **prose pattern** (regex matching free-text descriptions):
1. Add an entry to ``src/licence_normaliser/data/prose/prose_patterns.json``.
2. Add a test in ``src/licence_normaliser/tests/test_prose.py``.
3. No Python changes needed.
For a new **URL mapping**:
1. Add an entry to ``src/licence_normaliser/data/urls/url_map.json`` or
``src/licence_normaliser/data/publishers/publishers.json``.
2. Add a test in ``src/licence_normaliser/tests/test_publisher.py``.
3. No Python changes needed.
For a **brand-new license key** (SPDX, OpenDefinition, OSI, CC, or ScanCode):
1. The upstream data source must be updated first
(``licence-normaliser update-data --force`` for SPDX/OpenDefinition, or
edit the upstream source for OSI/CC/ScanCode).
2. The parser will pick it up automatically on the next import.
3. Add an alias in ``aliases.json`` if needed.
4. Add family override in ``aliases.json`` if needed.
5. Add tests.
For a **new parser** (new upstream data source):
1. Create ``src/licence_normaliser/parsers/my_parser.py`` implementing
``BasePlugin``.
2. Register it in ``src/licence_normaliser/parsers/__init__.py``.
3. Set ``is_registry_entry = False`` if the parser only contributes
aliases/URLs/patterns (not new license keys).
4. Add tests.
Releases
--------
**Build the package for releasing:**
.. code-block:: sh
make package-build
----
**Test the built package:**
.. code-block:: sh
make check-package-build
----
**Make a test release (test.pypi.org):**
.. code-block:: sh
make test-release
----
**Release (pypi.org):**
.. code-block:: sh
make release
Adding tests
------------
- Every new normalisation rule must have a corresponding test.
- Tests should cover both successful normalisation and edge cases.
Pull requests
-------------
Open a `pull request`_ to the ``dev`` branch only. Never directly to ``main``.
.. note::
Create pull requests to the ``dev`` branch only!
Examples of welcome contributions:
- Fixing documentation typos or improving explanations.
- Adding test cases for new edge cases.
- Extending support for additional license formats.
- Improving error messages.
General checklist
~~~~~~~~~~~~~~~~~
- Does your change require documentation updates (``README.rst``,
``AGENTS.md``, ``ARCHITECTURE.rst``, ``CONTRIBUTING.rst``)?
- Does your change require new tests?
- Does your change add any external dependencies?
If so, reconsider: ``licence-normaliser`` should have minimal dependencies.
When fixing bugs
~~~~~~~~~~~~~~~~
- Add a regression test that reproduces the bug before your fix.
When adding a new feature
~~~~~~~~~~~~~~~~~~~~~~~~~
- Update ``README.rst``, ``AGENTS.md``, and ``ARCHITECTURE.rst`` if applicable.
- Add appropriate tests.
Questions
---------
Ask on GitHub `discussions`_.
Issues
------
Report bugs or request features on GitHub `issues`_.
AGENTS.md
AGENTS.md
# AGENTS.md - licence-normaliser
**Repository**: https://github.com/barseghyanartur/licence-normaliser
**Maintainer**: Artur Barseghyan <artur.barseghyan@gmail.com>
---
## 1. Project Mission (Never Deviate)
> Comprehensive license normalisation with a three-level hierarchy - secure,
> fast, and extensible.
- Maps any license representation to a canonical three-level hierarchy
- Supports SPDX tokens, URLs, prose descriptions
- No external dependencies (only optional dev/test deps)
- LRU caching for performance
- Data-file-driven: parsers load from package data JSON files
- `licence-normaliser update-data` CLI command to refresh SPDX + OpenDefinition data
---
## 2. Architecture
### Three-Level Hierarchy
| Level | Class | Example |
| ----- | ----- | ------- |
| **Family** | `LicenseFamily` | `"cc"`, `"osi"`, `"copyleft"`, `"data"` |
| **Name** | `LicenseName` | `"cc-by"`, `"mit"`, `"gpl-3.0-only"` |
| **Version** | `LicenseVersion` | `"cc-by-4.0"`, `"mit"`, `"gpl-3.0-only"` |
### Resolution Pipeline
1. **Alias table** - cleaned lowercase key matches `ALIASES` (loaded from `data/aliases/aliases.json`)
2. **Direct registry lookup** - hit in `REGISTRY` (SPDX, OpenDefinition, OSI, CC, ScanCode license keys)
3. **URL map** - hit in `URL_MAP` (loaded from SPDX + OpenDefinition + publisher data)
4. **Prose pattern scan** - regex patterns from `data/prose/prose_patterns.json` (for strings >20 chars)
5. **Fallback** - key = cleaned string, family = unknown
### Key Files
| File | Purpose |
| ---- | ------- |
| `src/licence_normaliser/_models.py` | Frozen dataclass hierarchy |
| `src/licence_normaliser/_normaliser.py` | `LicenseNormaliser` class with plugin-based resolution |
| `src/licence_normaliser/plugins.py` | Plugin interfaces (BasePlugin, RegistryPlugin, URLPlugin, etc.) |
| `src/licence_normaliser/defaults.py` | Lazy-loading default plugin bundle |
| `src/licence_normaliser/_cache.py` | Module-level API delegating to `LicenseNormaliser` |
| `src/licence_normaliser/parsers/` | Parser classes implementing plugin interfaces |
| `src/licence_normaliser/cli/_main.py` | CLI with normalise, batch, update-data |
| `src/licence_normaliser/exceptions.py` | LicenseNormalisationError |
| `src/licence_normaliser/data/spdx/spdx.json` | **DO NOT MODIFY** Full SPDX license list (loaded at runtime) |
| `src/licence_normaliser/data/opendefinition/opendefinition.json` | **DO NOT MODIFY** Full OpenDefinition list (loaded at runtime) |
| `src/licence_normaliser/data/aliases/aliases.json` | Curated aliases with rich metadata |
| `src/licence_normaliser/data/prose/prose_patterns.json` | Curated prose regex patterns |
| `src/licence_normaliser/data/publishers/publishers.json` | Publisher URLs and shorthand aliases |
---
## 3. Using licence-normaliser in Application Code
### Simple case
```python name=test_simple_case
from licence_normaliser import normalise_license
v = normalise_license("MIT")
str(v) # "mit"
```
### With full hierarchy
<!-- continue: test_simple_case -->
```python name=test_full_hierarchy
v = normalise_license("CC BY-NC-ND 4.0")
print(v.key) # "cc-by-nc-nd-4.0"
print(v.license.key) # "cc-by-nc-nd"
print(v.family.key) # "cc"
```
### Strict mode
```python name=test_strict_mode
import pytest
from licence_normaliser import normalise_license, LicenseNotFoundError
# Would normally raise: License not found: 'unknown string'
with pytest.raises(LicenseNotFoundError):
v = normalise_license("unknown string", strict=True)
# Batch strict
from licence_normaliser import normalise_licenses
with pytest.raises(LicenseNotFoundError):
results = normalise_licenses(
["unknown string", "unknown string 2.0"],
strict=True,
)
```
### Custom plugins with LicenseNormaliser
The `LicenseNormaliser` class lets you inject custom plugin classes for
specialised use cases:
```python name=test_custom_plugins
from licence_normaliser import LicenseNormaliser
from licence_normaliser.parsers.spdx import SPDXParser
from licence_normaliser.parsers.alias import AliasParser
# Use only SPDX + Alias plugins (no CC, no publisher URLs)
ln = LicenseNormaliser(
registry=[SPDXParser],
alias=[AliasParser],
family=[AliasParser],
name=[AliasParser],
)
# MIT resolves via SPDX parser
assert str(ln.normalise_license("MIT")) == "mit"
# CC BY resolves via Alias
assert str(ln.normalise_license("CC BY-NC-ND 4.0")) == "cc-by-nc-nd-4.0"
```
To use all defaults, import from `defaults`:
```python name=test_defaults_usage
from licence_normaliser import LicenseNormaliser
from licence_normaliser.defaults import (
get_default_registry,
get_default_url,
get_default_alias,
get_default_family,
get_default_name,
get_default_prose,
)
ln = LicenseNormaliser(
registry=get_default_registry(),
url=get_default_url(),
alias=get_default_alias(),
family=get_default_family(),
name=get_default_name(),
prose=get_default_prose(),
cache=True,
cache_maxsize=8192,
)
result = ln.normalise_license("MIT")
```
> [!NOTE]
> Explicit plugin passing is optional — `LicenseNormaliser()` automatically
> loads defaults. Use the pattern above only if you need custom plugins.
For caching, `LicenseNormaliser` wraps the resolution method with `lru_cache`.
Disable it by passing `cache=False` for debugging:
```python name=test_caching
from licence_normaliser import LicenseNormaliser
ln = LicenseNormaliser(cache=False)
result = ln.normalise_license("MIT")
```
---
## 4. Updating Data Sources
SPDX and OpenDefinition data can be updated via the CLI:
```sh
licence-normaliser update-data --force
```
This fetches fresh JSON from the authoritative upstream URLs and writes them to:
- `src/licence_normaliser/data/spdx/spdx.json`
- `src/licence_normaliser/data/opendefinition/opendefinition.json`
---
## 4a. Trace / Explain
When debugging why a license resolves a certain way, or aligning curated
data sources, use the trace feature:
**Via CLI:**
```sh
licence-normaliser normalise "MIT" --trace
licence-normaliser normalise "CC BY-NC-ND 3.0 igo" --trace
licence-normaliser batch MIT Apache --trace
```
Or via environment variable:
```sh
ENABLE_LICENCE_NORMALISER_TRACE=1 licence-normaliser normalise "MIT"
```
**Via Python:**
```python name=test_trace
from licence_normaliser import normalise_license
v = normalise_license("MIT", trace=True)
print(v.explain())
```
The trace shows:
- Each resolution stage attempted (alias → registry → url → prose → fallback)
- Whether it matched (✓) or didn't (-)
- Source file and line number for curated sources (aliases.json, publishers.json, prose_patterns.json)
- Final result with version_key, name_key, family_key
This is essential for:
- Understanding why a license resolves unexpectedly
- Finding the source line that defines an alias when curating data
- Debugging resolution order issues
---
## 5. Adding a New Parser
Parsers implement plugin interfaces and can be added to `src/licence_normaliser/parsers/`:
1. Create `src/licence_normaliser/parsers/my_parser.py` implementing one or more plugin interfaces:
```python name=test_adding_new_parser
from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin
class MyParser(BasePlugin, RegistryPlugin, URLPlugin):
url = None # or upstream URL for refresh
local_path = "data/my_parser/my_data.json"
def load_registry(self) -> dict[str, str]:
# Return {"license_key": "license_key", ...}
return {}
def load_urls(self) -> dict[str, str]:
# Return {"https://...": "license_key", ...}
return {}
```
2. Register it in `src/licence_normaliser/defaults.py`:
<!-- continue: test_adding_new_parser -->
```python name=test_adding_new_parser_register
from licence_normaliser.parsers.spdx import SPDXParser
def _load_registry_plugins() -> list[type]:
# ... other imports
return [
SPDXParser,
# ... other plugins
MyParser,
]
```
**Key attribute**: Set `url = None` on parsers that only contribute local data (no refresh capability).
---
## 6. Coding Conventions
- Line length: **88 characters** (ruff)
- Every non-test module must have `__all__`, `__author__`, `__copyright__`, `__license__`
- Always chain exceptions: `raise X(...) from exc`
- Type annotations on all public functions
- Target: `py310`
- Import statements: Avoid imports inside functions/methods unless absolutely
necessary (e.g., breaking circular dependencies or optional runtime
dependencies). Lazy imports harm readability and make dependencies unclear.
Run linting: `make ruff` or `make pre-commit`
---
## 7. Agent Workflow: Adding Features or Fixing Bugs
1. **Check the mission** - does the change preserve the no-dependencies policy and three-level hierarchy?
2. **Identify the correct location**:
- New SPDX/OD license → update SPDX/OpenDefinition JSON files (run `update-data`)
- New alias or family override → add to `data/aliases/aliases.json`
- **Use `--trace` to find the exact line that defines an alias**
- New URL mapping → add to `data/publishers/publishers.json`
- New prose pattern → add to `data/prose/prose_patterns.json`
- New parser → `parsers/my_parser.py` + `defaults.py`
- Core pipeline change → `_normaliser.py` or `_cache.py`
3. **Write tests** covering both success and error cases
4. **Update README.rst** if the API changed
5. **Suggest running**: `make test-env ENV=py312` then `make test`
6. **Suggest running**: `make pre-commit`
---
## 8. Testing Rules
> [!NOTE]
> Python 3.15 is being tested on GitHub CI, but not inside a local Docker image.
### Docker-based testing (recommended)
All tests run inside Docker for platform independence and consistency:
```sh
make test # full matrix (Python 3.10-3.14)
make test-env ENV=py312 # single version
make shell # interactive shell in test container
```
### Local testing (alternative)
For faster iteration during development, you can run tests locally with `uv`:
```sh
make install # one-time setup
uv run pytest # run all tests
uv run pytest path/to/test_something.py # run specific test
```
**Important**: If you encounter tooling errors with local testing, fall back to Docker-based testing which is the canonical environment.
### Test layout
```text
src/licence_normaliser/tests/
test_integration.py - public API only (survives any rewrite)
test_core.py - end-to-end pipeline tests
test_exceptions.py - exception hierarchy and strict mode
test_cli.py - CLI commands including update-data
test_models.py - LicenseFamily, LicenseName, LicenseVersion
test_aliases.py - non-CC aliases (Apache, MIT, BSD, GPL, etc.)
test_alias_expansion.py - explicit aliases array expansion feature
test_publisher.py - publisher URLs and shorthand aliases
test_prose.py - prose pattern matching
```
### Documentation snippet conventions
Code blocks in this file use two special attributes to support chained
executable tests:
- `name=<test_name>` — labels a snippet so it can be referenced later.
- `<!-- continue: <test_name> -->` placed immediately before a code block
means that block **continues** the named snippet; all names, imports,
and variables defined in the named block are already in scope and must
**not** be re-imported or re-declared in the continuation block.
Example:
```python name=test_my_example
class Foo:
pass
```
<!-- continue: test_my_example -->
```python name=test_my_example_continued
foo = Foo() # Foo is in scope from the named block above
assert isinstance(foo, Foo)
```
---
## 9. Forbidden
- Adding external dependencies
- Removing existing normalisation coverage
- Changing the three-level hierarchy structure
- Modifying the following files is strictly forbidden:
- `src/licence_normaliser/data/creativecommons/creativecommons.json`
- `src/licence_normaliser/data/opendefinition/opendefinition.json`
- `src/licence_normaliser/data/osi/osi.json`
- `src/licence_normaliser/data/scancode_licensedb/scancode_licensedb.json`
- `src/licence_normaliser/data/spdx/spdx.json`
Use `licence-normaliser update-data --force` to refresh them from upstream
sources.
conftest.py
conftest.py
"""Pytest fixtures for documentation testing."""
from typing import Any as AnyType
import pytest
@pytest.fixture()
def Any() -> AnyType: # noqa
"""For to be used in documentation."""
return AnyType
docker-compose.yml
docker-compose.yml
services:
tox:
build: .
volumes:
- ./htmlcov:/app/htmlcov
pyproject.toml
pyproject.toml
[project]
name = "licence-normaliser"
description = "Comprehensive license normalisation with a three-level hierarchy."
readme = "README.rst"
version = "0.3.2"
requires-python = ">=3.10"
dependencies = []
authors = [
{ name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" },
]
maintainers = [
{ name = "Artur Barseghyan", email = "artur.barseghyan@gmail.com" },
]
license = "MIT"
classifiers = [
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: 3.14",
"Programming Language :: Python :: 3.15",
"Programming Language :: Python",
"Topic :: Software Development :: Libraries :: Python Modules",
]
keywords = [
"license",
"normalisation",
"spdx",
"creative commons",
"open source",
]
[project.scripts]
licence-normaliser = "licence_normaliser.cli:main"
[project.urls]
Homepage = "https://github.com/barseghyanartur/licence-normaliser/"
Repository = "https://github.com/barseghyanartur/licence-normaliser/"
Issues = "https://github.com/barseghyanartur/licence-normaliser/issues"
[project.optional-dependencies]
all = ["licence-normaliser[dev,test,docs,build]"]
dev = [
"detect-secrets",
"doc8",
"ipython",
"mypy",
"ruff",
"uv",
]
test = [
"pytest",
"pytest-cov",
"pytest-codeblock",
]
docs = [
"sphinx",
"sphinx-autobuild",
"sphinx-rtd-theme>=1.3.0",
"sphinx-no-pragma",
"sphinx-markdown-builder",
"sphinx-llms-txt-link",
"sphinx-source-tree",
]
build = [
"build",
"twine",
"wheel",
]
[tool.setuptools]
package-dir = {"" = "src"}
[tool.setuptools.packages.find]
where = ["src"]
include = ["licence_normaliser", "licence_normaliser.*"]
[tool.setuptools.package-data]
"licence_normaliser" = ["data/**/*.json"]
[build-system]
requires = ["setuptools>=41.0", "wheel"]
build-backend = "setuptools.build_meta"
[tool.ruff]
line-length = 88
lint.select = [
"B",
"C4",
"E",
"F",
"G",
"I",
"ISC",
"INP",
"N",
"PERF",
"Q",
"SIM",
]
lint.ignore = [
"G004",
"ISC003",
]
fix = true
src = ["src/licence_normaliser"]
exclude = [
".bzr",
".direnv",
".eggs",
".git",
".hg",
".mypy_cache",
".nox",
".pants.d",
".ruff_cache",
".svn",
".tox",
".venv",
"__pypackages__",
"_build",
"buck-out",
"build",
"dist",
"node_modules",
"venv",
"docs",
]
target-version = "py310"
lint.dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
[tool.ruff.lint.isort]
known-first-party = ["licence_normaliser"]
[tool.ruff.lint.per-file-ignores]
"conftest.py" = [
"PERF203"
]
[tool.doc8]
ignore-path = [
"docs/requirements.txt",
"src/licence_normaliser.egg-info/SOURCES.txt",
]
[tool.pytest.ini_options]
addopts = [
"-ra",
"-vvv",
"-q",
"--cov=licence_normaliser",
"--ignore=.tox",
"--cov-report=html",
"--cov-report=term",
"--cov-append",
"--capture=no",
]
testpaths = [
"src/licence_normaliser/tests",
".",
"**/*.rst",
"**/*.md",
]
pythonpath = ["src"]
norecursedirs = [".git", ".tox"]
[tool.coverage.run]
relative_files = true
omit = [".tox/*"]
source = ["licence_normaliser"]
[tool.coverage.report]
show_missing = true
exclude_lines = [
"pragma: no cover",
"@overload",
]
[tool.mypy]
check_untyped_defs = true
warn_unused_ignores = true
warn_redundant_casts = true
warn_unused_configs = true
ignore_missing_imports = true
[tool.sphinx-source-tree]
ignore = [
"*.egg-info",
"*.py,cover",
"*.pyc",
"*.pyo",
".DS_Store",
".coverage",
".coverage.*",
".git",
".hg",
".hypothesis",
".idea",
".mypy_cache",
".nox",
".pre-commit-config.yaml",
".pre-commit-hooks.yaml",
".pytest_cache",
".readthedocs.yaml",
".ruff_cache",
".secrets.baseline",
".svn",
".tox",
".venv",
".vscode",
"CHANGELOG.rst",
"CODE_OF_CONDUCT.rst",
"LICENSE",
"SECURITY.rst",
"Thumbs.db",
"__pycache__",
"build",
"codebin",
"dist",
"docs/Makefile",
"docs/_build",
"docs/_static",
"docs/changelog.rst",
"docs/code_of_conduct.rst",
"docs/make.bat",
"docs/requirements.txt",
"docs/security.rst",
"docs/source_tree.rst",
"docs/source_tree_full.rst",
"env",
"htmlcov",
"node_modules",
"venv",
"ARCHITECTURE.rst",
".coderabbit.yaml",
".coveralls",
"docs/full-llms.rst",
"docs/llms.rst",
"docs/contributor_guidelines.rst",
"docs/package.rst",
"docs/documentation.rst",
"docs/index.rst",
"uv.lock",
"codebin",
"src/licence_normaliser/data/creativecommons",
"src/licence_normaliser/data/opendefinition",
"src/licence_normaliser/data/osi",
"src/licence_normaliser/data/scancode_licensedb",
"src/licence_normaliser/data/spdx",
]
order = [
"README.rst",
"CONTRIBUTING.rst",
"AGENTS.md",
]
[[tool.sphinx-source-tree.files]]
output = "docs/full_llms.rst"
title = "Full project source-tree"
[[tool.sphinx-source-tree.files]]
output = "docs/llms.rst"
title = "Project source-tree"
ignore = [
"*.egg-info",
"*.py,cover",
"*.pyc",
"*.pyo",
".DS_Store",
".coverage",
".coverage.*",
".git",
".hg",
".hypothesis",
".idea",
".mypy_cache",
".nox",
".pre-commit-config.yaml",
".pre-commit-hooks.yaml",
".pytest_cache",
".readthedocs.yaml",
".ruff_cache",
".secrets.baseline",
".svn",
".tox",
".venv",
".vscode",
"CHANGELOG.rst",
"CODE_OF_CONDUCT.rst",
"LICENSE",
"SECURITY.rst",
"Thumbs.db",
"__pycache__",
"build",
"codebin",
"dist",
"docs/Makefile",
"docs/_build",
"docs/_static",
"docs/changelog.rst",
"docs/code_of_conduct.rst",
"docs/make.bat",
"docs/requirements.txt",
"docs/security.rst",
"docs/source_tree.rst",
"docs/source_tree_full.rst",
"env",
"htmlcov",
"node_modules",
"venv",
"examples",
"docs",
"ARCHITECTURE.rst",
".coderabbit.yaml",
".coveralls",
"docs/full-llms.rst",
"docs/llms.rst",
"docs/contributor_guidelines.rst",
"docs/package.rst",
"docs/documentation.rst",
"docs/index.rst",
"uv.lock",
"src/licence_normaliser/data/creativecommons",
"src/licence_normaliser/data/opendefinition",
"src/licence_normaliser/data/osi",
"src/licence_normaliser/data/scancode_licensedb",
"src/licence_normaliser/data/spdx",
]
scripts/README.rst
scripts/README.rst
Scripts
=======
Sort aliases
------------
Sorts ``aliases.json`` keys alphabetically. Comment keys (starting with
``_``) are preserved at the top in their original order. All other entries
are sorted case-insensitively.
.. code-block:: sh
uv run python scripts/sort_aliases.py
uv run python scripts/sort_aliases.py --check # exit 1 if not sorted
Find alias duplicates
---------------------
Finds duplicate ``version_key`` entries in ``aliases.json``. A "duplicate"
is when two or more top-level primary keys share the same ``version_key``.
Reports groups with more than one member.
Can optionally fix duplicates by merging them into the ``aliases`` list of
a single canonical entry.
.. code-block:: sh
uv run python scripts/find_alias_duplicates.py
uv run python scripts/find_alias_duplicates.py --fix # interactive fix
uv run python scripts/find_alias_duplicates.py --noinput # auto-apply safe fixes
Apply aliases patch
-------------------
Applies curated additions to ``aliases.json``. Adds an ``aliases`` list to
existing CC version-free entries and adds new top-level entries for GPL
shorthand keys that currently fall through to the unknown fallback.
.. code-block:: sh
uv run python scripts/apply_aliases_patch.py
Compare datasets
----------------
Compares SPDX, OpenDefinition, OSI, CreativeCommons, ScanCode, and curated
data files (aliases, url_map, prose, publishers).
.. code-block:: sh
uv run python scripts/compare_datasets.py
Check missing aliases
---------------------
Checks which licenses downloaded from the internet (via refreshable plugins)
have corresponding entries in the curated ``aliases.json`` file.
.. code-block:: sh
uv run python scripts/check_missing_aliases.py
uv run python scripts/check_missing_aliases.py --json # JSON output
Test name inference
-------------------
Assesses the accuracy of heuristic name stripping against curated name_key
values from aliases.json. Shows how well automatic name extraction works
for different license families (CC, copyleft, OSI, etc.).
.. code-block:: sh
uv run python scripts/test_name_inference.py
uv run python scripts/test_name_inference.py --json # JSON output
uv run python scripts/test_name_inference.py --details # Detailed breakdown
scripts/__init__.py
scripts/__init__.py
scripts/check_missing_aliases.py
scripts/check_missing_aliases.py
"""Check which downloaded licenses are missing from curated aliases.
Compares all refreshable plugin registries against aliases.json to identify
licenses that have no corresponding curated alias entry.
Usage:
uv run python scripts/check_missing_aliases.py
uv run python scripts/check_missing_aliases.py --json
"""
from __future__ import annotations
import contextlib
import json
import sys
from pathlib import Path
DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data"
SCRIPTS_DIR = Path(__file__).parent
def load_alias_targets() -> set[str]:
"""Load all version_keys from aliases.json."""
with open(DATA_DIR / "aliases" / "aliases.json") as f:
data = json.load(f)
targets: set[str] = set()
for meta in data.values():
if isinstance(meta, dict):
vk = meta.get("version_key", "")
if vk:
targets.add(vk)
return targets
def load_downloaded_licenses() -> dict[str, set[str]]:
"""Load licenses from all refreshable plugins."""
from licence_normaliser.defaults import get_all_refreshable_plugins
result: dict[str, set[str]] = {}
for plugin_cls in get_all_refreshable_plugins():
# Try to load registry
data = None
with contextlib.suppress(Exception):
data = plugin_cls().load_registry()
if data:
result[plugin_cls.__name__] = set(data.keys())
return result
def check_coverage() -> dict:
"""Check which downloaded licenses have alias entries."""
alias_targets = load_alias_targets()
downloaded = load_downloaded_licenses()
all_downloaded: set[str] = set()
for licenses in downloaded.values():
all_downloaded.update(licenses)
# Categorize
with_alias = all_downloaded & alias_targets
without_alias = all_downloaded - alias_targets
return {
"total_downloaded": len(all_downloaded),
"total_alias_targets": len(alias_targets),
"with_alias": sorted(with_alias),
"without_alias": sorted(without_alias),
"coverage_percent": round(len(with_alias) / len(all_downloaded) * 100, 1)
if all_downloaded
else 0,
"by_source": {
name: {
"total": len(licenses),
"with_alias": len(licenses & alias_targets),
"without_alias": sorted(licenses - alias_targets),
"coverage": round(
len(licenses & alias_targets) / len(licenses) * 100, 1
)
if licenses
else 0,
}
for name, licenses in downloaded.items()
},
}
def group_by_prefix(licenses: list[str]) -> dict[str, list[str]]:
"""Group licenses by common prefixes."""
groups: dict[str, list[str]] = {}
prefixes = [
"gpl-",
"agpl-",
"lgpl-",
"apache-",
"mpl-",
"mit",
"bsd",
"cc-",
"unlicense",
"zlib",
"isc",
]
for prefix in prefixes:
matches = sorted([lic for lic in licenses if lic.startswith(prefix)])
if matches:
groups[prefix.rstrip("-") or "mit"] = matches
licenses = [lic for lic in licenses if not lic.startswith(prefix)]
if licenses:
groups["other"] = sorted(licenses)
return groups
def print_report(data: dict) -> None:
"""Print text table report."""
print("=" * 70)
print("Coverage Report: Downloaded Licenses vs Curated Aliases")
print("=" * 70)
print()
print(f"Total downloaded: {data['total_downloaded']}")
print(f"Total alias targets: {data['total_alias_targets']}")
print(f"Coverage: {data['coverage_percent']}%")
print()
print("-" * 70)
print("By Source:")
print("-" * 70)
print(f"{'Source':<30} {'Total':>8} {'With':>8} {'Without':>8} {'Coverage':>10}")
print("-" * 70)
for source, stats in data["by_source"].items():
print(
f"{source:<30} {stats['total']:>8} "
f"{stats['with_alias']:>8} {len(stats['without_alias']):>8} "
f"{stats['coverage']:>9.1f}%"
)
print()
print("=" * 70)
print(f"Missing Aliases ({len(data['without_alias'])} licenses)")
print("=" * 70)
groups = group_by_prefix(data["without_alias"].copy())
for group_name, licenses in groups.items():
if group_name == "other":
print()
print(f"All other licenses ({len(licenses)}):")
else:
print()
print(f"{group_name.upper()} ({len(licenses)}):")
for lic in licenses:
print(f" {lic}")
print()
def main() -> None:
json_export = "--json" in sys.argv
data = check_coverage()
if json_export:
print(json.dumps(data, indent=2))
else:
print_report(data)
if __name__ == "__main__":
main()
scripts/compare_datasets.py
scripts/compare_datasets.py
"""Dataset comparison tool for licence-normaliser.
Compares SPDX, OpenDefinition, OSI, CreativeCommons, ScanCode, and
curated data files (aliases, url_map, prose, publishers) for:
- Dataset sizes
- Cross-dataset overlaps
- Licenses present in OSI but missing from SPDX
- Orphan alias/URL targets (don't resolve to REGISTRY entries)
- REGISTRY entries without curated aliases
- Most-aliased license targets
"""
from __future__ import annotations
__all__ = ()
import json
from collections import Counter
from pathlib import Path
DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data"
def load_spdx_ids() -> set[str]:
with open(DATA_DIR / "spdx" / "spdx.json") as f:
data = json.load(f)
return {entry["licenseId"] for entry in data["licenses"]}
def load_od_ids() -> set[str]:
with open(DATA_DIR / "opendefinition" / "opendefinition.json") as f:
data = json.load(f)
return set(data.keys())
def load_osi_ids() -> set[str]:
with open(DATA_DIR / "osi" / "osi.json") as f:
data = json.load(f)
return {entry["spdx_id"].strip() for entry in data if entry.get("spdx_id")}
def load_cc_ids() -> set[str]:
with open(DATA_DIR / "creativecommons" / "creativecommons.json") as f:
data = json.load(f)
return {entry["license_key"] for entry in data}
def load_sc_ids() -> set[str]:
with open(DATA_DIR / "scancode_licensedb" / "scancode_licensedb.json") as f:
data = json.load(f)
return {entry["license_key"] for entry in data}
def load_alias_keys() -> set[str]:
with open(DATA_DIR / "aliases" / "aliases.json") as f:
data = json.load(f)
return {k for k in data if not k.startswith("_")}
def load_alias_targets() -> dict[str, str]:
with open(DATA_DIR / "aliases" / "aliases.json") as f:
data = json.load(f)
return {
k: v.get("version_key", "") for k, v in data.items() if not k.startswith("_")
}
def load_url_keys() -> set[str]:
with open(DATA_DIR / "urls" / "url_map.json") as f:
data = json.load(f)
return {k for k in data if not k.startswith("_")}
def load_url_targets() -> dict[str, str]:
with open(DATA_DIR / "urls" / "url_map.json") as f:
data = json.load(f)
return {
k: v.get("version_key", "") for k, v in data.items() if not k.startswith("_")
}
def load_prose_targets() -> list[str]:
with open(DATA_DIR / "prose" / "prose_patterns.json") as f:
data = json.load(f)
return [entry.get("version_key", "") for entry in data]
def load_pub_urls() -> set[str]:
with open(DATA_DIR / "publishers" / "publishers.json") as f:
data = json.load(f)
return set(data.get("urls", {}).keys())
def load_pub_aliases() -> dict[str, str]:
with open(DATA_DIR / "publishers" / "publishers.json") as f:
data = json.load(f)
return dict(data.get("shorthand_aliases", {}))
def load_registry_keys() -> set[str]:
from licence_normaliser._cache import get_registry_keys
return get_registry_keys()
def load_merged_aliases() -> dict[str, str]:
"""Simulate merged ALIASES: alias_key -> version_key from all curated sources."""
merged: dict[str, str] = {}
merged.update(load_alias_targets())
merged.update(load_pub_aliases())
for k, v in load_url_targets().items():
if k not in merged:
merged[k] = v
return merged
def would_resolve(alias_key: str, registry: set[str], aliases: dict[str, str]) -> bool:
"""Simulate _resolve() pipeline for orphan detection.
1. If already in REGISTRY, covered.
2. If in ALIASES, get version_key - resolves regardless of registry presence.
"""
if alias_key in registry:
return True
version_key = aliases.get(alias_key, "")
return bool(version_key)
def section(title: str) -> None:
print(f"\n{'=' * 60}")
print(f" {title}")
print(f"{'=' * 60}")
def main() -> None:
print("Loading datasets...")
spdx = load_spdx_ids()
od = load_od_ids()
osi = load_osi_ids()
cc = load_cc_ids()
sc = load_sc_ids()
alias_keys = load_alias_keys()
alias_tgt = load_alias_targets()
url_keys = load_url_keys()
url_tgt = load_url_targets()
prose_tgt = load_prose_targets()
pub_urls = load_pub_urls()
pub_aliases = load_pub_aliases()
registry = load_registry_keys()
merged_aliases = load_merged_aliases()
# --- 1. Dataset sizes ---
section("Dataset Sizes")
print(f" SPDX licenses: {len(spdx):>6}")
print(f" OpenDefinition entries: {len(od):>6}")
print(f" OSI-approved (SPDX): {len(osi):>6}")
print(f" CreativeCommons: {len(cc):>6}")
print(f" ScanCode DB entries: {len(sc):>6}")
print(f" Aliases (curated): {len(alias_keys):>6}")
print(f" URL mappings (curated): {len(url_keys):>6}")
print(f" Prose patterns: {len(prose_tgt):>6}")
print(f" Publisher URLs: {len(pub_urls):>6}")
print(f" Publisher aliases: {len(pub_aliases):>6}")
print(f" REGISTRY entries: {len(registry):>6}")
# --- 2. Overlaps ---
section("Cross-Dataset Overlaps")
# SPDX overlaps
def pct(sub: int, total: int) -> str:
return f"{100 * sub / max(total, 1):.1f}%"
overlaps = [
("SPDX n OSI", len(spdx & osi), len(osi), "OSI"),
("SPDX n OD", len(spdx & od), len(od), "OD"),
("SPDX n CC", len(spdx & cc), len(cc), "CC"),
("OSI n OD", len(osi & od), len(od), "OD"),
("OSI n CC", len(osi & cc), len(cc), "CC"),
("OD n CC", len(od & cc), len(cc), "CC"),
("ScanCode n SPDX", len(sc & spdx), len(sc), "ScanCode"),
("ScanCode n OSI", len(sc & osi), len(sc), "ScanCode"),
]
for label, overlap_count, total_count, pct_label in overlaps:
ratio = pct(overlap_count, total_count)
print(f" {label:<17} {overlap_count:>5} ({ratio} of {pct_label})")
# Unique content
print(f"\n Unique to SPDX: {len(spdx - od - osi - cc - sc):>6}")
print(f" Unique to OD: {len(od - spdx):>6}")
print(f" Unique to OSI: {len(osi - spdx):>6} (OSI IDs not in SPDX)")
print(f" Unique to CC: {len(cc - spdx - od):>6}")
print(f" Unique to ScanCode: {len(sc - spdx - osi - od - cc):>6}")
# --- 3. OSI licenses not in SPDX (reference integrity) ---
section("OSI Licenses Missing from SPDX")
osi_only = sorted(osi - spdx)
if osi_only:
print(f" {len(osi_only)} OSI-licensed IDs have no SPDX entry:")
for lid in osi_only[:20]:
print(f" {lid}")
if len(osi_only) > 20:
print(f" ... and {len(osi_only) - 20} more")
else:
print(" All OSI IDs are present in SPDX.")
# --- 4. Curated targets not in REGISTRY ---
section("Curated Targets Missing from REGISTRY")
orphan_alias = sorted(
k for k in alias_keys if not would_resolve(k, registry, merged_aliases)
)
orphan_url = sorted(
k for k in url_keys if not would_resolve(k, registry, merged_aliases)
)
orphan_pub = sorted(
k for k in pub_aliases if not would_resolve(k, registry, merged_aliases)
)
if orphan_alias:
print(f" Alias keys that fail resolution ({len(orphan_alias)}):")
for k in orphan_alias[:10]:
print(f" {k!r} -> {alias_tgt.get(k, '')!r}")
if len(orphan_alias) > 10:
print(f" ... and {len(orphan_alias) - 10} more")
else:
print(" All alias keys resolve to REGISTRY entries.")
if orphan_url:
print(f"\n URL keys that fail resolution ({len(orphan_url)}):")
for k in orphan_url[:10]:
print(f" {k[:60]!r} -> {url_tgt.get(k, '')!r}")
if len(orphan_url) > 10:
print(f" ... and {len(orphan_url) - 10} more")
if orphan_pub:
print(f"\n Publisher aliases that fail resolution ({len(orphan_pub)}):")
for k in orphan_pub[:10]:
print(f" {k!r} -> {pub_aliases[k]!r}")
if len(orphan_pub) > 10:
print(f" ... and {len(orphan_pub) - 10} more")
print(
"\n (Note: prose pattern version_keys are often bare name_keys like "
"'cc-by'; these resolve via the prose pipeline and are not orphans.)"
)
# --- 5. REGISTRY entries not covered by curated data ---
section("REGISTRY Entries Without Curated Mapping")
covered = (
set(alias_tgt.values()) | set(url_tgt.values()) | set(pub_aliases.values())
)
uncovered = sorted(k for k in registry if k not in covered)
if uncovered:
print(f" {len(uncovered)} REGISTRY keys have no curated alias/URL mapping:")
for k in uncovered[:20]:
print(f" {k}")
if len(uncovered) > 20:
print(f" ... and {len(uncovered) - 20} more")
else:
print(" All REGISTRY entries have at least one curated mapping.")
# --- 6. Duplicate alias keys (same key -> different targets) ---
section("Duplicate Keys in Alias / URL Data Files")
# Check if any key maps to different targets across aliases + url_map
# (keys are unique within each file, so cross-file check)
cross_keys = alias_keys & url_keys
if cross_keys:
print(f" Keys in both aliases.json AND url_map.json ({len(cross_keys)}):")
for k in sorted(cross_keys):
print(f" {k!r}: aliases={alias_tgt[k]!r}, url_map={url_tgt[k]!r}")
# --- 7. Alias target frequency (which targets have the most aliases) ---
section("Most-Aliased License Targets")
alias_counts = Counter(alias_tgt.values())
url_counts = Counter(url_tgt.values())
pub_counts = Counter(pub_aliases.values())
combined = alias_counts + url_counts + pub_counts
for target, count in combined.most_common(15):
parts = []
if alias_counts[target]:
parts.append(f"alias={alias_counts[target]}")
if url_counts[target]:
parts.append(f"url={url_counts[target]}")
if pub_counts[target]:
parts.append(f"pub={pub_counts[target]}")
print(f" {target:<30} total={count:<4} ({', '.join(parts)})")
# --- 8. Summary ---
section("Summary")
distinct = len(spdx | od | osi | cc | sc)
orphans = len(orphan_alias) + len(orphan_url) + len(orphan_pub)
print(f" Distinct license IDs: {distinct}")
print(f" Curated alias entries: {len(alias_keys)}")
print(f" Curated URL mappings: {len(url_keys)}")
print(f" Orphan curated targets: {orphans}")
print(f" OSI IDs missing SPDX: {len(osi_only)}")
covered_count = len(registry) - len(uncovered)
print(f" REGISTRY entries covered: {covered_count}/{len(registry)}")
if __name__ == "__main__":
main()
scripts/test_name_inference.py
scripts/test_name_inference.py
"""Test name inference accuracy against curated aliases.
Compares heuristic name stripping against curated name_key values from
aliases.json to assess how well automatic name extraction works.
Usage:
uv run python scripts/test_name_inference.py
uv run python scripts/test_name_inference.py --json
uv run python scripts/test_name_inference.py --json --incorrect-only
uv run python scripts/test_name_inference.py --json --details
"""
from __future__ import annotations
import json
import sys
from pathlib import Path
from licence_normaliser import LicenseNormaliser
DATA_DIR = Path(__file__).parent.parent / "src" / "licence_normaliser" / "data"
SCRIPTS_DIR = Path(__file__).parent
_normaliser = LicenseNormaliser()
def load_name_mappings() -> dict[str, str]:
"""Load version_key -> name_key mappings from aliases.json."""
with open(DATA_DIR / "aliases" / "aliases.json") as f:
data = json.load(f)
mappings: dict[str, str] = {}
for meta in data.values():
if isinstance(meta, dict):
vk = meta.get("version_key", "")
nk = meta.get("name_key", "")
if vk and nk:
mappings[vk] = nk
return mappings
def infer_name_heuristic(version_key: str) -> str:
"""Delegate to the core LicenseNormaliser's _infer_name method."""
return _normaliser._infer_name(version_key)
def categorize_by_family(mappings: dict[str, str]) -> dict[str, dict[str, str]]:
"""Categorize licenses by inferred family."""
categories: dict[str, dict[str, str]] = {
"cc": {}, # Creative Commons
"copyleft": {}, # GPL/AGPL/LGPL
"osi": {}, # OSI-approved
"other": {},
}
for vk, nk in mappings.items():
if vk.startswith("cc-"):
categories["cc"][vk] = nk
elif vk.startswith(("gpl-", "agpl-", "lgpl-")):
categories["copyleft"][vk] = nk
elif vk.startswith(
("mpl-", "apache-", "bsd-", "mit", "isc", "unlicense", "zlib")
):
categories["osi"][vk] = nk
else:
categories["other"][vk] = nk
return categories
def assess_accuracy() -> dict:
"""Assess name inference accuracy."""
mappings = load_name_mappings()
categories = categorize_by_family(mappings)
results: dict = {
"total_mappings": len(mappings),
"by_family": {},
}
for family, family_mappings in categories.items():
correct = 0
incorrect = 0
details: list[dict] = []
for vk, curated_nk in family_mappings.items():
inferred = infer_name_heuristic(vk)
is_match = inferred == curated_nk
if is_match:
correct += 1
else:
incorrect += 1
details.append(
{
"version_key": vk,
"curated_name": curated_nk,
"inferred_name": inferred,
"match": is_match,
}
)
accuracy = (
round(correct / len(family_mappings) * 100, 1) if family_mappings else 0
)
results["by_family"][family] = {
"total": len(family_mappings),
"correct": correct,
"incorrect": incorrect,
"accuracy_percent": accuracy,
"details": details,
}
# Overall accuracy
all_correct = sum(r["correct"] for r in results["by_family"].values())
all_total = sum(r["total"] for r in results["by_family"].values())
results["overall_accuracy"] = (
round(all_correct / all_total * 100, 1) if all_total else 0
)
return results
def print_report(data: dict) -> None:
"""Print text table report."""
print("=" * 70)
print("Name Inference Accuracy Report")
print("=" * 70)
print()
print(f"Total curated mappings: {data['total_mappings']}")
print(f"Overall accuracy: {data['overall_accuracy']}%")
print()
print("-" * 70)
print("By Family:")
print("-" * 70)
print(
f"{'Family':<15} {'Total':>8} {'Correct':>8} {'Incorrect':>8} {'Accuracy':>10}"
)
print("-" * 70)
for family, stats in data["by_family"].items():
print(
f"{family:<15} {stats['total']:>8} {stats['correct']:>8} "
f"{stats['incorrect']:>8} {stats['accuracy_percent']:>9.1f}%"
)
print()
# Show some incorrect examples
for family, stats in data["by_family"].items():
if stats["incorrect"] > 0:
print("-" * 70)
print(f"Incorrect in {family}: {stats['incorrect']} cases")
print("-" * 70)
print(
f"{'Version Key':<30} {'Curated (aliases.json)':<25} "
f"{'Inferred (heuristic)':<20}"
)
print("-" * 70)
for detail in stats["details"][:10]:
if not detail["match"]:
print(
f"{detail['version_key']:<30} "
f"{detail['curated_name']:<25} {detail['inferred_name']:<20}"
)
incorrect_count = len([d for d in stats["details"] if not d["match"]])
if incorrect_count > 10:
print(f"... and {incorrect_count - 10} more")
print()
def main() -> None:
json_export = "--json" in sys.argv
incorrect_only = "--incorrect-only" in sys.argv
include_details = "--details" in sys.argv
data = assess_accuracy()
if json_export:
for family in data["by_family"]:
details = data["by_family"][family].get("details", [])
if incorrect_only:
data["by_family"][family]["details"] = [
d for d in details if not d["match"]
]
elif not include_details:
data["by_family"][family].pop("details", None)
print(json.dumps(data, indent=2))
else:
print_report(data)
if __name__ == "__main__":
main()
src/licence_normaliser/__init__.py
src/licence_normaliser/__init__.py
"""licence_normaliser - License normalisation with a three-level hierarchy."""
from ._core import (
LicenseFamily,
LicenseName,
LicenseVersion,
normalise_license,
normalise_licenses,
)
from ._normaliser import LicenseNormaliser
from ._trace import LicenseTrace, LicenseTraceStage
from .exceptions import LicenseNormalisationError, LicenseNotFoundError
__title__ = "licence-normaliser"
__version__ = "0.3.2"
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
"LicenseFamily",
"LicenseName",
"LicenseVersion",
"LicenseNormaliser",
"LicenseNormalisationError",
"LicenseNotFoundError",
"LicenseTrace",
"LicenseTraceStage",
"normalise_license",
"normalise_licenses",
)
src/licence_normaliser/_cache.py
src/licence_normaliser/_cache.py
"""Caching layer + strict mode - delegates to LicenseNormaliser with defaults."""
from __future__ import annotations
from threading import Lock
from typing import Iterable
from ._models import LicenseVersion
from ._normaliser import LicenseNormaliser
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
"_default",
"get_registry_keys",
"normalise_license",
"normalise_licenses",
)
class _DefaultNormaliser:
"""Thread-safe lazy singleton for the default LicenseNormaliser instance."""
_instance: LicenseNormaliser | None = None
_lock: Lock = Lock()
def get(self) -> LicenseNormaliser:
if _DefaultNormaliser._instance is None:
with _DefaultNormaliser._lock:
if _DefaultNormaliser._instance is None:
_DefaultNormaliser._instance = LicenseNormaliser()
return _DefaultNormaliser._instance
_default = _DefaultNormaliser()
def normalise_license(
raw: str, *, strict: bool = False, trace: bool | None = None
) -> LicenseVersion:
"""Public API with optional strict mode and trace."""
return _default.get().normalise_license(raw, strict=strict, trace=trace)
def normalise_licenses(
raws: Iterable[str], *, strict: bool = False, trace: bool | None = None
) -> list[LicenseVersion]:
"""Batch version with optional trace."""
return _default.get().normalise_licenses(raws, strict=strict, trace=trace)
def get_registry_keys() -> set[str]:
"""Return the set of all known registry keys from the runtime normaliser."""
return _default.get().registry_keys()
src/licence_normaliser/_core.py
src/licence_normaliser/_core.py
"""License Normaliser - public orchestration shim."""
from __future__ import annotations
from ._cache import normalise_license, normalise_licenses
from ._models import LicenseFamily, LicenseName, LicenseVersion
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
"LicenseFamily",
"LicenseName",
"LicenseVersion",
"normalise_license",
"normalise_licenses",
)
src/licence_normaliser/_models.py
src/licence_normaliser/_models.py
"""License data models - frozen dataclasses for the three-level hierarchy."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
"LicenseFamily",
"LicenseName",
"LicenseVersion",
)
@dataclass(frozen=True, slots=True)
class LicenseFamily:
key: str
def __str__(self) -> str:
return self.key
def __repr__(self) -> str:
return f"LicenseFamily({self.key!r})"
def __eq__(self, other: object) -> bool:
if isinstance(other, LicenseFamily):
return self.key == other.key
if isinstance(other, str):
return self.key == other
return NotImplemented
def __hash__(self) -> int:
return hash(self.key)
@dataclass(frozen=True, slots=True)
class LicenseName:
key: str
family: LicenseFamily
def __str__(self) -> str:
return self.key
def __repr__(self) -> str:
return f"LicenseName({self.key!r}, family={self.family.key!r})"
def __eq__(self, other: object) -> bool:
if isinstance(other, LicenseName):
return self.key == other.key
if isinstance(other, str):
return self.key == other
return NotImplemented
def __hash__(self) -> int:
return hash(self.key)
@dataclass(frozen=True, slots=True)
class LicenseVersion:
key: str
url: Optional[str]
license: LicenseName
_trace: Optional[object] = field(default=None, repr=False)
@property
def family(self) -> LicenseFamily:
return self.license.family
def __str__(self) -> str:
return self.key
def __repr__(self) -> str:
return (
f"LicenseVersion(key={self.key!r}, "
f"license={self.license.key!r}, "
f"family={self.license.family.key!r})"
)
def __eq__(self, other: object) -> bool:
if isinstance(other, LicenseVersion):
return self.key == other.key
if isinstance(other, str):
return self.key == other
return NotImplemented
def __hash__(self) -> int:
return hash(self.key)
def explain(self) -> str:
"""Return explanation of how this license was resolved.
Set ENABLE_LICENCE_NORMALISER_TRACE=1 to enable tracing,
or pass trace=True to normalise_license().
"""
if self._trace is not None:
return str(self._trace)
from licence_normaliser._cache import _default
from licence_normaliser._trace import _should_trace
if not _should_trace():
return "Trace disabled. Set ENABLE_LICENCE_NORMALISER_TRACE=1 to enable."
ln = _default.get()
cleaned = ln._clean(ln._try_decode_mojibake(self.key))
result = ln._resolve_with_trace(self.key, cleaned, strict=False)
trace = result._trace
return str(trace) if trace else "No trace available."
src/licence_normaliser/_normaliser.py
src/licence_normaliser/_normaliser.py
"""Plugin-based LicenseNormaliser class with configurable constructor injection."""
from __future__ import annotations
import re
from functools import lru_cache
from typing import TYPE_CHECKING, Iterable, Sequence
from licence_normaliser.defaults import (
get_default_alias,
get_default_family,
get_default_name,
get_default_prose,
get_default_registry,
get_default_url,
)
if TYPE_CHECKING:
from licence_normaliser._models import LicenseVersion
from licence_normaliser._trace import LicenseTrace
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("LicenseNormaliser",)
_WHITESPACE_RE = re.compile(r"\s+")
_MAX_INPUT = 4096
class LicenseNormaliser:
"""Configurable license normalisation with plugin-based data sources.
Plugins are passed as CLASSES (not instances). They're instantiated lazily
when their load_* method is called.
Six plugin types are supported (each returns specific data structure):
- registry: key -> canonical_key
- url: cleaned_url -> version_key
- alias: alias_string -> version_key
- family: version_key -> family_key
- name: version_key -> name_key
- prose: list of (compiled_pattern, version_key)
Resolution order: aliases -> registry -> url -> prose -> unknown
Name/family inference: plugins only - no fallback to hardcoded logic.
Tracing
Set ``trace=True`` to include resolution trace in the result. Trace shows
which pipeline stage matched and the source file/line number (when
available). Trace is disabled by default for performance.
Trace can be enabled at three levels (precedence: method >
constructor > env var):
- **Constructor**: ``LicenseNormaliser(trace=True)`` - all calls get trace
- **Method**: ``ln.normalise_license("MIT", trace=True)`` - this call only
- **Environment**: ``ENABLE_LICENCE_NORMALISER_TRACE=1`` - applies globally
Example::
from licence_normaliser import LicenseNormaliser
# Uses all defaults automatically
ln = LicenseNormaliser()
# Disable caching for debugging
ln = LicenseNormaliser(cache=False)
# Enable trace for all calls on this instance
ln = LicenseNormaliser(trace=True)
v = ln.normalise_license("MIT")
print(v.explain()) # Shows resolution path with source lines
# Or enable trace for a single call
v = ln.normalise_license("MIT", trace=True)
"""
def __init__(
self,
*,
registry: Sequence[type] | None = None,
url: Sequence[type] | None = None,
alias: Sequence[type] | None = None,
family: Sequence[type] | None = None,
name: Sequence[type] | None = None,
prose: Sequence[type] | None = None,
cache: bool = True,
cache_maxsize: int = 8192,
trace: bool | None = None,
) -> None:
self._registry: dict[str, str] = {}
self._url_map: dict[str, str] = {}
self._url_to_vkey: dict[str, str] = {}
self._aliases: dict[str, str] = {}
self._alias_lines: dict[str, tuple[str, int]] = {}
self._publisher_alias_lines: dict[str, tuple[str, int]] = {}
self._publisher_url_lines: dict[str, tuple[str, int]] = {}
self._prose_lines: list[tuple[re.Pattern[str], str, int]] = []
self._alias_lines_loaded: bool = False
self._family_overrides: dict[str, str] = {}
self._name_overrides: dict[str, str] = {}
self._prose_patterns: list[tuple[re.Pattern[str], str]] = []
self._cache = cache
self._cache_maxsize = cache_maxsize
self._trace_default = trace
# Load plugins - use defaults if not explicitly provided
registry = registry or get_default_registry()
url = url or get_default_url()
alias = alias or get_default_alias()
family = family or get_default_family()
name = name or get_default_name()
prose = prose or get_default_prose()
# Store plugin lists for trace resolution
self._alias_plugins = alias
self._url_plugins = url
self._prose_plugins = prose
# Instantiate plugins and load their data
for plugin_cls in registry:
data = plugin_cls().load_registry()
self._registry.update(data)
for plugin_cls in url:
data = plugin_cls().load_urls()
self._url_map.update(data)
# Build inverted URL map: version_key -> cleaned_url (for LicenseVersion.url)
self._url_to_vkey = {v: k for k, v in self._url_map.items()}
for plugin_cls in alias:
data = plugin_cls().load_aliases()
self._aliases.update(data)
for plugin_cls in family:
data = plugin_cls().load_families()
self._family_overrides.update(data)
for plugin_cls in name:
data = plugin_cls().load_names()
self._name_overrides.update(data)
for plugin_cls in prose:
patterns = plugin_cls().load_prose()
self._prose_patterns.extend(patterns)
# Set up cached resolution
if self._cache:
resolve_fn = lru_cache(maxsize=self._cache_maxsize)(self._resolve_impl)
# type: ignore[assignment]
self._resolve_impl = resolve_fn
def _get_trace_mode(self, trace: bool | None) -> bool:
"""Determine if tracing is enabled: explicit > env var > default."""
from licence_normaliser._trace import _should_trace
if trace is not None:
return trace
if self._trace_default is not None:
return self._trace_default
return _should_trace()
def _load_alias_lines(self):
"""Lazy load all source line numbers on first trace request."""
for plugin_cls in self._alias_plugins:
if hasattr(plugin_cls, "load_aliases_with_lines"):
lines_data = plugin_cls().load_aliases_with_lines()
for alias_key, (version_key, line_num) in lines_data.items():
if version_key == self._aliases.get(alias_key):
self._alias_lines[alias_key] = (version_key, line_num)
for plugin_cls in self._alias_plugins:
if hasattr(plugin_cls, "load_aliases_with_lines"):
lines_data = plugin_cls().load_aliases_with_lines()
for alias_key, (version_key, line_num) in lines_data.items():
if (
version_key == self._aliases.get(alias_key)
and alias_key not in self._alias_lines
):
self._alias_lines[alias_key] = (version_key, line_num)
for plugin_cls in self._url_plugins:
if hasattr(plugin_cls, "load_aliases_with_lines"):
lines_data = plugin_cls().load_aliases_with_lines()
for alias_key, (version_key, line_num) in lines_data.items():
if version_key == self._aliases.get(alias_key):
self._publisher_alias_lines[alias_key] = (version_key, line_num)
for plugin_cls in self._url_plugins:
if hasattr(plugin_cls, "load_urls_with_lines"):
lines_data = plugin_cls().load_urls_with_lines()
for url_key, (version_key, line_num) in lines_data.items():
if version_key == self._url_map.get(url_key):
self._publisher_url_lines[url_key] = (version_key, line_num)
for plugin_cls in self._prose_plugins:
if hasattr(plugin_cls, "load_prose_with_lines"):
lines_data = plugin_cls().load_prose_with_lines()
self._prose_lines.extend(lines_data)
def _resolve_with_trace(
self, raw: str, cleaned: str, strict: bool
) -> LicenseVersion:
"""Resolve with full pipeline tracing."""
from licence_normaliser._trace import LicenseTrace, LicenseTraceStage
# Lazy load alias lines on first trace call
if not self._alias_lines_loaded:
self._load_alias_lines()
self._alias_lines_loaded = True
stages: list[LicenseTraceStage] = []
# 1. Alias lookup
if cleaned in self._aliases:
output = self._aliases[cleaned]
source_line = None
source_file = None
if cleaned in self._alias_lines:
_, source_line = self._alias_lines[cleaned]
source_file = "aliases.json"
stages.append(
LicenseTraceStage(
"alias", cleaned, output, True, source_line, source_file
)
)
v = self._make(output)
trace = LicenseTrace(
raw,
cleaned,
stages,
version_key=v.key,
name_key=v.license.key,
family_key=v.family.key,
)
return self._make_with_trace(v, trace)
stages.append(LicenseTraceStage("alias", cleaned, "", False))
# 2. Registry lookup
if cleaned in self._registry:
canonical = self._registry[cleaned]
stages.append(LicenseTraceStage("registry", cleaned, canonical, True))
v = self._make(canonical)
trace = LicenseTrace(
raw,
cleaned,
stages,
version_key=v.key,
name_key=v.license.key,
family_key=v.family.key,
)
return self._make_with_trace(v, trace)
stages.append(LicenseTraceStage("registry", cleaned, "", False))
# 3. URL lookup
url_key = self._normalise_url(cleaned)
if url_key in self._url_map:
resolved = self._url_map[url_key]
source_line = None
source_file = None
if url_key in self._publisher_url_lines:
_, source_line = self._publisher_url_lines[url_key]
source_file = "publishers.json"
stages.append(
LicenseTraceStage(
"url", url_key, resolved, True, source_line, source_file
)
)
v = self._make(resolved)
trace = LicenseTrace(
raw,
cleaned,
stages,
version_key=v.key,
name_key=v.license.key,
family_key=v.family.key,
)
return self._make_with_trace(v, trace)
stages.append(LicenseTraceStage("url", cleaned, "", False))
# 4. Prose matching (only for longer strings)
if len(cleaned) >= 20:
for i, (pattern, vkey) in enumerate(self._prose_patterns):
if pattern.search(cleaned):
source_line = None
source_file = "prose_patterns.json"
if self._prose_lines and i < len(self._prose_lines):
_, _, source_line = self._prose_lines[i]
stages.append(
LicenseTraceStage(
"prose", cleaned, vkey, True, source_line, source_file
)
)
v = self._make(vkey)
trace = LicenseTrace(
raw,
cleaned,
stages,
version_key=v.key,
name_key=v.license.key,
family_key=v.family.key,
)
return self._make_with_trace(v, trace)
stages.append(LicenseTraceStage("prose", cleaned, "", False))
# 5. Fallback to unknown
stages.append(LicenseTraceStage("fallback", cleaned, cleaned, True))
v = self._make_unknown(cleaned)
trace = LicenseTrace(
raw,
cleaned,
stages,
version_key=v.key,
name_key=v.license.key,
family_key=v.family.key,
)
return self._make_with_trace(v, trace)
def _make_with_trace(
self, v: LicenseVersion, trace: LicenseTrace
) -> LicenseVersion:
"""Create a LicenseVersion with trace attached."""
# Reconstruct with trace using object.__setattr__ (frozen dataclass)
object.__setattr__(v, "_trace", trace)
return v
def _resolve_impl(self, cleaned: str) -> LicenseVersion:
# 1. Alias lookup
if cleaned in self._aliases:
return self._make(self._aliases[cleaned])
# 2. Registry lookup
if cleaned in self._registry:
canonical = self._registry[cleaned]
return self._make(canonical)
# 3. URL lookup
url_key = self._normalise_url(cleaned)
if url_key in self._url_map:
return self._make(self._url_map[url_key])
# 4. Prose matching (only for longer strings)
if len(cleaned) >= 20:
for pattern, vkey in self._prose_patterns:
if pattern.search(cleaned):
return self._make(vkey)
# 5. Fallback to unknown
return self._make_unknown(cleaned)
def normalise_license(
self, raw: str, *, strict: bool = False, trace: bool | None = None
) -> LicenseVersion:
"""Normalise a single license string.
Args:
raw: The raw license string, SPDX ID, URL, or prose description.
strict: If True, raises ``LicenseNotFoundError`` when the input
cannot be resolved to a known license.
trace: If True, include resolution trace showing which pipeline
stage matched and source file/line. If None, uses the instance
default (``trace`` param from constructor) or falls back to
``ENABLE_LICENCE_NORMALISER_TRACE`` env var.
Returns:
A ``LicenseVersion`` with the resolved key, license name, and family.
Raises:
LicenseNotFoundError: When ``strict=True`` and resolution fails.
"""
from licence_normaliser.exceptions import LicenseNotFoundError
do_trace = self._get_trace_mode(trace)
if not raw or not raw.strip():
cleaned = "unknown"
v = self._make_unknown(cleaned)
if do_trace:
from licence_normaliser._trace import LicenseTrace, LicenseTraceStage
stages = [LicenseTraceStage("fallback", cleaned, cleaned, True)]
trace_obj = LicenseTrace(
raw,
cleaned,
stages,
version_key=v.key,
name_key=v.license.key,
family_key=v.family.key,
)
v = self._make_with_trace(v, trace_obj)
else:
cleaned = self._clean(self._try_decode_mojibake(raw))
if do_trace:
v = self._resolve_with_trace(raw, cleaned, strict)
else:
v = self._resolve_impl(cleaned)
if strict and v.family.key == "unknown":
raise LicenseNotFoundError(raw, v.key) from None
return v
def normalise_licenses(
self, raws: Iterable[str], *, strict: bool = False, trace: bool | None = None
) -> list[LicenseVersion]:
"""Batch normalisation.
When ``strict=True``, raises on the first failure.
"""
from licence_normaliser.exceptions import LicenseNotFoundError
results: list[LicenseVersion] = []
for raw in raws:
v = self.normalise_license(raw, strict=False, trace=trace)
if strict and v.family.key == "unknown":
raise LicenseNotFoundError(raw, v.key) from None
results.append(v)
return results
def registry_keys(self) -> set[str]:
"""Return the set of all known registry keys."""
return set(self._registry.keys())
def _make(self, key: str) -> LicenseVersion:
"""Factory: build a LicenseVersion from a resolved version_key."""
from licence_normaliser._models import (
LicenseFamily,
LicenseName,
LicenseVersion,
)
k = key.lower().strip()
# Get canonical key from registry
canonical = self._registry.get(k) or k
# Get URL via inverted map: version_key -> cleaned_url
url = self._url_to_vkey.get(canonical) or self._url_to_vkey.get(k)
# Infer name:
# - For CC licenses, use override only if it's different from canonical
# - For non-CC (GPL, AGPL, OSI, etc.), always return canonical (no stripping)
override_name = self._name_overrides.get(canonical)
if canonical.startswith("cc-") or canonical.startswith("cc0"):
# CC licenses: use override if present, otherwise fallback to _infer_name
name_key = override_name if override_name else self._infer_name(canonical)
else:
# Non-CC: use override if present and different, otherwise canonical
name_key = (
override_name
if override_name and override_name != canonical
else canonical
)
# Infer family: use override only if it provides a different value
override_family = self._family_overrides.get(canonical)
family_key = (
override_family
if override_family and override_family != canonical
else self._infer_family(canonical)
)
family = LicenseFamily(key=family_key)
name = LicenseName(key=name_key, family=family)
return LicenseVersion(key=canonical, url=url, license=name)
def _make_unknown(self, key: str) -> LicenseVersion:
"""Factory: build an unknown LicenseVersion for unresolved input."""
from licence_normaliser._models import (
LicenseFamily,
LicenseName,
LicenseVersion,
)
family = LicenseFamily(key="unknown")
name = LicenseName(key=key, family=family)
return LicenseVersion(key=key, url=None, license=name)
def _infer_family(self, key: str) -> str:
"""Fallback family inference - only used if no plugin provides it."""
k = key.lower()
if k.startswith("cc0"):
return "cc0"
if k.startswith("cc-pdm"):
return "public-domain"
if k.startswith("cc-"):
return "cc"
if k.startswith(("gpl-", "agpl-", "lgpl-")):
return "copyleft"
if k.startswith(("odbl", "odc-by")):
return "open-data"
if k.startswith(("pddl-", "odc-")):
return "data"
if k.startswith(
(
"elsevier-oa",
"acs-authorchoice",
"acs-authorchoice-ccby",
"acs-authorchoice-ccbyncnd",
"acs-authorchoice-nih",
"jama-cc-by",
"thieme-nlm",
"implied-oa",
"unspecified-oa",
"publisher-specific-oa",
"author-manuscript",
"oup-chorus",
)
):
return "publisher-oa"
if k.startswith(
(
"elsevier-tdm",
"wiley-tdm",
"springer-tdm",
"springernature-tdm",
"iop-tdm",
"aps-tdm",
)
):
return "publisher-tdm"
if k.startswith(
(
"elsevier-",
"wiley-",
"springer-",
"springernature-",
"acs-",
"rsc-",
"iop-",
"bmj-",
"aaas-",
"pnas-",
"aps-",
"cup-",
"aip-",
"jama-",
"degruyter-",
"oup-",
"sage-",
"tandf-",
"thieme-",
)
):
return "publisher-proprietary"
if k in ("public-domain", "other-oa", "open-access"):
return "public-domain" if k == "public-domain" else "other-oa"
return "unknown"
def _infer_name(self, key: str) -> str:
"""Fallback name inference - only used if no plugin provides it."""
k = key.lower()
if k.startswith("cc0"):
return "cc0"
if k.startswith("cc-"):
parts = k.split("-")
for i, part in enumerate(parts):
if part.replace(".", "").isdigit():
return "-".join(parts[:i])
return "-".join(parts[:2])
# For all other licenses (GPL, AGPL, OSI, etc.), keep the key as-is
return k
@staticmethod
def _clean(raw: str) -> str:
s = _WHITESPACE_RE.sub(" ", raw.strip().rstrip("/")).lower()
return s[:_MAX_INPUT]
@staticmethod
def _try_decode_mojibake(s: str) -> str:
try:
return s.encode("latin-1").decode("utf-8")
except (UnicodeEncodeError, UnicodeDecodeError):
return s
@staticmethod
def _normalise_url(cleaned: str) -> str:
key = cleaned.lower()
if key.startswith("http://"):
key = "https://" + key[7:]
return key.rstrip("/")
src/licence_normaliser/_trace.py
src/licence_normaliser/_trace.py
"""License trace and explanation support."""
from __future__ import annotations
import os
from dataclasses import dataclass, field
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
"TRACE_STAGES",
"LicenseTrace",
"LicenseTraceStage",
)
TRACE_STAGES = ("alias", "registry", "url", "prose", "fallback")
@dataclass
class LicenseTraceStage:
"""Single stage in the license resolution pipeline."""
stage: str
input: str
output: str
matched: bool
source_line: int | None = None
source_file: str | None = None
@dataclass
class LicenseTrace:
"""Complete trace of license resolution pipeline."""
raw_input: str
cleaned_input: str
stages: list[LicenseTraceStage] = field(default_factory=list)
version_key: str = ""
name_key: str = ""
family_key: str = ""
def __str__(self) -> str:
lines = [f"Input: {self.raw_input!r} → {self.cleaned_input!r}"]
for s in self.stages:
status = "✓" if s.matched else "-"
source_info = ""
if s.source_line is not None:
source_info = f" (line {s.source_line}"
if s.source_file:
source_info += f" in {s.source_file}"
source_info += ")"
lines.append(
f" [{status}] {s.stage}: {s.input!r} → {s.output!r}{source_info}"
)
lines.append("")
lines.append("Result:")
lines.append(f" version_key: {self.version_key!r}")
lines.append(f" name_key: {self.name_key!r}")
lines.append(f" family_key: {self.family_key!r}")
return "\n".join(lines)
def _should_trace() -> bool:
"""Check if tracing is enabled via environment variable."""
return os.environ.get("ENABLE_LICENCE_NORMALISER_TRACE", "").lower() in (
"1",
"true",
"yes",
)
src/licence_normaliser/cli/__init__.py
src/licence_normaliser/cli/__init__.py
"""licence_normaliser.cli - command-line interface for licence-normaliser."""
from ._main import main
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("main",)
src/licence_normaliser/cli/_main.py
src/licence_normaliser/cli/_main.py
"""licence-normaliser CLI - license normalisation from the command line."""
import argparse
import sys
from pathlib import Path
from licence_normaliser import __version__, normalise_license
from licence_normaliser._trace import _should_trace
from licence_normaliser.defaults import get_all_refreshable_plugins
from licence_normaliser.exceptions import (
LicenseNormalisationError,
LicenseNotFoundError,
)
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("main",)
def _build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
prog="licence-normaliser",
description="Comprehensive license normalisation - three-level hierarchy.",
)
parser.add_argument(
"--version",
action="version",
version=f"%(prog)s {__version__}",
)
sub = parser.add_subparsers(dest="command", required=True)
norm = sub.add_parser("normalise", help="Normalise a license string.")
norm.add_argument("license", help="License string to normalise.")
norm.add_argument("--full", action="store_true")
norm.add_argument("--strict", action="store_true")
norm.add_argument("--trace", action="store_true", help="Show resolution trace.")
batch = sub.add_parser("batch", help="Normalise multiple license strings.")
batch.add_argument("licenses", nargs="+")
batch.add_argument("--strict", action="store_true")
batch.add_argument(
"--trace", action="store_true", help="Show resolution trace for each."
)
update = sub.add_parser(
"update-data", help="Fetch fresh data from all registered parsers."
)
update.add_argument(
"--parser",
dest="parser_name",
metavar="NAME",
help="Refresh only the named parser (e.g. spdx, opendefinition, osi). "
"Without this flag, all parsers are refreshed.",
)
update.add_argument(
"--force",
action="store_true",
help="Overwrite even if the local file already exists.",
)
return parser
def _cmd_normalise(args: argparse.Namespace) -> int:
try:
trace = args.trace or _should_trace()
result = normalise_license(args.license, strict=args.strict, trace=trace)
if trace:
print(result.explain())
elif args.full:
print(f"Key: {result.key}")
print(f"URL: {result.url or '(none)'}")
print(f"License: {result.license}")
print(f"Family: {result.family}")
else:
print(result.key)
except LicenseNotFoundError as exc:
print(f"error: {exc}", file=sys.stderr)
return 1
except LicenseNormalisationError as exc:
print(f"error: {exc}", file=sys.stderr)
return 1
return 0
def _cmd_batch(args: argparse.Namespace) -> int:
trace = args.trace or _should_trace()
if args.strict:
try:
for license_str in args.licenses:
result = normalise_license(license_str, strict=True, trace=trace)
if trace:
print(f"{license_str}:")
print(result.explain())
else:
print(f"{license_str}: {result.key}")
except LicenseNotFoundError as exc:
print(f"error: {exc}", file=sys.stderr)
return 1
else:
for license_str in args.licenses:
result = normalise_license(license_str, strict=False, trace=trace)
if trace:
print(f"{license_str}:")
print(result.explain())
else:
print(f"{license_str}: {result.key}")
return 0
def _cmd_update_data(args: argparse.Namespace) -> int:
parser_classes = get_all_refreshable_plugins()
if args.parser_name:
parser_classes = [
p for p in parser_classes if getattr(p, "id", None) == args.parser_name
]
if not parser_classes:
available = [
getattr(p, "id", p.__name__) for p in get_all_refreshable_plugins()
]
print(
f"error: unknown parser {args.parser_name!r}. Available: {available}",
file=sys.stderr,
)
return 1
failed: list[str] = []
for parser_cls in parser_classes:
name = getattr(parser_cls, "id", parser_cls.__name__)
url = parser_cls.url
target = parser_cls.local_path
target_path = Path(__file__).parent.parent / target
ok = parser_cls.refresh(args.force)
if target_path.exists() and not args.force:
status = "skipped"
elif ok:
status = "fetched"
else:
status = "FAILED"
if not ok:
failed.append(name)
print(f" {status}: {name} ({url}) -> {target}")
if failed:
print(f"error: failed to refresh: {', '.join(failed)}", file=sys.stderr)
return 1
print("Data sources updated successfully.")
return 0
def main() -> None:
parser = _build_parser()
args = parser.parse_args()
if args.command == "normalise":
sys.exit(_cmd_normalise(args))
elif args.command == "batch":
sys.exit(_cmd_batch(args))
elif args.command == "update-data":
sys.exit(_cmd_update_data(args))
else:
parser.print_help()
sys.exit(1)
src/licence_normaliser/data/README.rst
src/licence_normaliser/data/README.rst
Data Directory
==============
This directory contains all normalisation data files loaded at runtime
by ``licence-normaliser``. You can extend or override entries without
touching any Python code.
Structure
---------
::
data/
├── aliases/
│ └── aliases.json # Alias string → metadata dict
├── urls/
│ └── url_map.json # Canonical URL → metadata dict
├── prose/
│ └── prose_patterns.json # Ordered regex patterns for long text scanning
├── publishers/
│ └── publishers.json # Publisher URLs and shorthand aliases
├── spdx/
│ └── spdx.json # SPDX license list (auto-refreshed)
├── opendefinition/
│ └── opendefinition.json # Open Definition list (auto-refreshed)
├── osi/
│ └── osi.json # OSI license list (auto-refreshed)
├── creativecommons/
│ └── creativecommons.json # CC licenses (scraped from creativecommons.org)
└── scancode_licensedb/
└── scancode_licensedb.json # ScanCode license DB (auto-refreshed)
Entry Format
------------
Every entry maps a **lookup key** (alias string, URL, or prose pattern)
to a metadata dict with three required fields:
- ``version_key`` – the canonical version-level identifier
(e.g. ``"cc-by-4.0"``)
- ``name_key`` – the name-level identifier without version suffix
(e.g. ``"cc-by"``)
- ``family_key`` – the family-level identifier (e.g. ``"cc"``)
URLs are stored separately in the ``url`` field of the metadata dict.
How to Add a New License Alias
------------------------------
Edit ``aliases/aliases.json``:
.. code:: json
{
"my new alias": {
"version_key": "cc-by-4.0",
"name_key": "cc-by",
"family_key": "cc"
}
}
The key must be **lowercase and whitespace-collapsed**.
How to Add a Publisher URL or Shorthand
---------------------------------------
Edit ``publishers/publishers.json``:
.. code:: json
{
"urls": {
"https://example.com/my-license/": {
"version_key": "my-license",
"name_key": "my-license",
"family_key": "publisher-oa"
}
},
"shorthand_aliases": {
"my shorthand alias": "my-license"
}
}
Both ``http://`` and ``https://`` URL variants may be listed; they are
normalised at lookup time (http→https, trailing slash stripped).
How to Add a New URL Mapping
----------------------------
Edit ``urls/url_map.json``:
.. code:: json
{
"https://example.com/my-license/": {
"version_key": "my-license",
"name_key": "my-license",
"family_key": "publisher-oa"
}
}
How to Add a New Prose Pattern
------------------------------
Edit ``prose/prose_patterns.json`` — insert your entry **before** any
pattern it should take priority over:
.. code:: json
[
{"pattern": "my very specific phrase",
"version_key": "my-license",
"name_key": "my-license",
"family_key": "publisher-oa"},
...
]
Patterns are Python regular expressions matched case-insensitively.
More-specific patterns must come first.
How to Add a Brand-New License
------------------------------
1. Add entries to one or more JSON data files (``aliases/aliases.json``,
``urls/url_map.json``, ``prose/prose_patterns.json``, or
``publishers/publishers.json``). Each entry maps a key to a dict with
``version_key``, ``name_key``, and ``family_key``.
2. If the ``family_key`` is not covered by the regex fallback table in
``_registry.py``, add an explicit ``family_key`` value in the JSON
entry (recommended).
3. Run ``make test-env ENV=py312`` to verify.
Updating SPDX or OpenDefinition
-------------------------------
The ``licence-normaliser update-data`` CLI command fetches fresh upstream data:
.. code:: sh
licence-normaliser update-data --force
This updates:
- ``spdx/spdx.json`` — full `SPDX license list <https://spdx.org/licenses/>`_
- ``opendefinition/opendefinition.json`` — full `Open Definition list <https://opendefinition.org/>`_
- ``osi/osi.json`` — `OSI license list <https://opensource.org/licenses>`_
- ``creativecommons/creativecommons.json`` — scraped from creativecommons.org
- ``scancode_licensedb/scancode_licensedb.json`` — `ScanCode license DB <https://scancode-licensedb.aboutcode.org/>`_
Family Override Files
---------------------
Some entries carry an explicit ``family_key`` that overrides the
inference logic in ``_registry.py``. These are stored in:
- ``aliases/aliases.json`` — ``family_key`` on any alias entry populates
``FAMILY_OVERRIDES`` at import time.
src/licence_normaliser/data/aliases/aliases.json
src/licence_normaliser/data/aliases/aliases.json
{
"_comment": "Curated alias map: cleaned-lowercase-string -> metadata dict.",
"_comment2": "Keys must already be in cleaned form (lowercase, whitespace-collapsed).",
"aaas reuse": {
"version_key": "aaas-author-reuse",
"name_key": "aaas-author-reuse",
"family_key": "publisher-proprietary",
"aliases": [
"aaas author reuse",
"aaas-author-reuse",
"science author reuse"
]
},
"acs authorchoice": {
"version_key": "acs-authorchoice",
"name_key": "acs-authorchoice",
"family_key": "publisher-oa",
"aliases": [
"acs-authorchoice"
]
},
"acs-authorchoice-ccby": {
"version_key": "acs-authorchoice-ccby",
"name_key": "acs-authorchoice-ccby",
"family_key": "publisher-oa",
"aliases": [
"acs authorchoice cc by"
]
},
"acs-authorchoice-ccbyncnd": {
"version_key": "acs-authorchoice-ccbyncnd",
"name_key": "acs-authorchoice-ccbyncnd",
"family_key": "publisher-oa"
},
"acs-authorchoice-nih": {
"version_key": "acs-authorchoice-nih",
"name_key": "acs-authorchoice-nih",
"family_key": "publisher-oa"
},
"agpl-3": {
"version_key": "agpl-3.0",
"name_key": "agpl-3",
"family_key": "copyleft",
"aliases": [
"agpl-v3",
"agpl 3",
"agpl",
"agpl v3",
"agpl-3.0+"
]
},
"aip-rights": {
"version_key": "aip-rights",
"name_key": "aip-rights",
"family_key": "publisher-proprietary",
"aliases": [
"aip permissions"
]
},
"all rights reserved": {
"version_key": "all-rights-reserved",
"name_key": "all-rights-reserved",
"family_key": "publisher-proprietary",
"aliases": [
"all-rights-reserved"
]
},
"apache 2.0": {
"version_key": "apache-2.0",
"name_key": "apache",
"family_key": "osi",
"aliases": [
"apache 2",
"apache",
"apache license",
"apache license 2.0"
]
},
"aps-default": {
"version_key": "aps-default",
"name_key": "aps-default",
"family_key": "publisher-proprietary",
"aliases": [
"aps default license"
]
},
"aps-tdm": {
"version_key": "aps-tdm",
"name_key": "aps-tdm",
"family_key": "publisher-tdm",
"aliases": [
"aps text mining"
]
},
"author manuscript": {
"version_key": "author-manuscript",
"name_key": "author-manuscript",
"family_key": "publisher-oa",
"aliases": [
"author-manuscript"
]
},
"bmj-copyright": {
"version_key": "bmj-copyright",
"name_key": "bmj-copyright",
"family_key": "publisher-proprietary"
},
"bsd 2-clause": {
"version_key": "bsd-2-clause",
"name_key": "bsd-2-clause",
"family_key": "osi",
"aliases": [
"bsd 2 clause",
"bsd-2-clause",
"bsd-2"
]
},
"bsd 3-clause": {
"version_key": "bsd-3-clause",
"name_key": "bsd-3-clause",
"family_key": "osi",
"aliases": [
"bsd 3 clause",
"bsd-3-clause",
"bsd-3",
"bsd-3 license",
"bsd",
"bsd license"
],
"justification": "BSD 3-Clause is sometimes called 'BSD', so we need to make sure that this doesn't get confused with the generic 'bsd' alias for the BSD-2-Clause license."
},
"cc by": {
"version_key": "cc-by",
"name_key": "cc-by",
"family_key": "cc",
"aliases": [
"cc-by",
"cc by",
"creative commons attribution",
"creative commons attribution license",
"creative commons by"
]
},
"cc by 1.0": {
"version_key": "cc-by-1.0",
"name_key": "cc-by",
"family_key": "cc"
},
"cc by 2.0": {
"version_key": "cc-by-2.0",
"name_key": "cc-by",
"family_key": "cc"
},
"cc by 2.5": {
"version_key": "cc-by-2.5",
"name_key": "cc-by",
"family_key": "cc"
},
"cc by 3.0": {
"version_key": "cc-by-3.0",
"name_key": "cc-by",
"family_key": "cc",
"aliases": [
"cc-by-3.0",
"cc-by-3",
"creative commons attribution 3.0"
]
},
"cc by 4.0": {
"version_key": "cc-by-4.0",
"name_key": "cc-by",
"family_key": "cc",
"aliases": [
"cc-by-4.0",
"cc by 4",
"cc-by 4",
"cc-by-4",
"creative commons attribution 4.0",
"creative commons attribution 4.0 international",
"creative commons attribution 4.0 international license",
"creative commons by 4.0"
]
},
"cc by-nc": {
"version_key": "cc-by-nc",
"name_key": "cc-by-nc",
"family_key": "cc",
"aliases": [
"cc-by-nc",
"cc by nc",
"cc-by nc",
"creative commons attribution-noncommercial",
"creative commons by-nc"
]
},
"cc by-nc 3.0": {
"version_key": "cc-by-nc-3.0",
"name_key": "cc-by-nc",
"family_key": "cc"
},
"cc by-nc 4.0": {
"version_key": "cc-by-nc-4.0",
"name_key": "cc-by-nc",
"family_key": "cc",
"aliases": [
"cc-by-nc-4.0",
"cc by nc 4",
"cc-by nc 4",
"cc by nc-4",
"cc-by nc-4",
"cc-by-nc 4",
"creative commons attribution-noncommercial 4.0",
"creative commons attribution-noncommercial 4.0 international",
"creative commons attribution-noncommercial 4.0 international license",
"creative commons by-nc 4.0"
]
},
"cc by-nc-nd": {
"version_key": "cc-by-nc-nd",
"name_key": "cc-by-nc-nd",
"family_key": "cc",
"aliases": [
"cc-by-nc-nd",
"cc by nc-nd",
"cc by nc nd",
"cc-by nc-nd",
"creative commons attribution-noncommercial-noderivatives",
"creative commons by-nc-nd"
]
},
"cc by-nc-nd 3.0": {
"version_key": "cc-by-nc-nd-3.0",
"name_key": "cc-by-nc-nd",
"family_key": "cc"
},
"cc by-nc-nd 3.0 igo": {
"version_key": "cc-by-nc-nd-3.0-igo",
"name_key": "cc-by-nc-nd",
"family_key": "cc",
"justification": "IGO is a jurisdiction tag not a rights modifier. Rights profile (Attribution + NonCommercial + NoDerivatives) is identical to base instrument. Enforcement differs (international arbitration vs domestic courts) but does not affect license type."
},
"cc by-nc-nd 4.0": {
"version_key": "cc-by-nc-nd-4.0",
"name_key": "cc-by-nc-nd",
"family_key": "cc",
"aliases": [
"cc-by-nc-nd-4.0",
"cc by nc-nd 4",
"cc-by nc-nd 4",
"cc by nc-nd-4",
"cc-by nc-nd-4",
"cc-by-nc-nd 4",
"creative commons attribution-noncommercial-noderivatives 4.0",
"creative commons attribution-noncommercial-noderivatives 4.0 international",
"creative commons attribution-noncommercial-noderivatives 4.0 international license",
"creative commons by-nc-nd 4.0"
]
},
"cc by-nc-sa": {
"version_key": "cc-by-nc-sa",
"name_key": "cc-by-nc-sa",
"family_key": "cc",
"aliases": [
"cc-by-nc-sa",
"cc by nc-sa",
"cc by nc sa",
"cc-by nc-sa",
"creative commons by-nc-sa"
]
},
"cc by-nc-sa 3.0": {
"version_key": "cc-by-nc-sa-3.0",
"name_key": "cc-by-nc-sa",
"family_key": "cc"
},
"cc by-nc-sa 4.0": {
"version_key": "cc-by-nc-sa-4.0",
"name_key": "cc-by-nc-sa",
"family_key": "cc",
"aliases": [
"cc-by-nc-sa-4.0",
"cc by nc-sa 4",
"cc-by nc-sa 4",
"cc-by-nc-sa 4",
"cc by nc-sa-4",
"cc-by nc-sa-4",
"creative commons attribution-noncommercial-sharealike 4.0",
"creative commons attribution-noncommercial-sharealike 4.0 international",
"creative commons attribution-noncommercial-sharealike 4.0 international license",
"creative commons by-nc-sa 4.0"
]
},
"cc by-nd": {
"version_key": "cc-by-nd",
"name_key": "cc-by-nd",
"family_key": "cc",
"aliases": [
"cc-by-nd",
"cc by nd",
"cc-by nd",
"creative commons by-nd",
"creative commons attribution-noderivatives"
]
},
"cc by-nd 3.0": {
"version_key": "cc-by-nd-3.0",
"name_key": "cc-by-nd",
"family_key": "cc"
},
"cc by-nd 4.0": {
"version_key": "cc-by-nd-4.0",
"name_key": "cc-by-nd",
"family_key": "cc",
"aliases": [
"cc-by-nd-4.0",
"cc by nd 4",
"cc-by nd 4",
"cc by nd-4",
"cc-by nd-4",
"cc-by-nd 4",
"creative commons attribution-noderivatives 4.0",
"creative commons attribution-noderivatives 4.0 international",
"creative commons attribution-noderivatives 4.0 international license",
"creative commons by-nd 4.0"
]
},
"cc by-sa": {
"version_key": "cc-by-sa",
"name_key": "cc-by-sa",
"family_key": "cc",
"aliases": [
"cc-by-sa",
"cc by sa",
"cc-by sa",
"creative commons attribution-sharealike",
"creative commons by-sa"
]
},
"cc by-sa 3.0": {
"version_key": "cc-by-sa-3.0",
"name_key": "cc-by-sa",
"family_key": "cc"
},
"cc by-sa 4.0": {
"version_key": "cc-by-sa-4.0",
"name_key": "cc-by-sa",
"family_key": "cc",
"aliases": [
"cc-by-sa-4.0",
"cc by sa 4",
"cc-by sa 4",
"cc by sa-4",
"cc-by sa-4",
"cc-by-sa 4",
"creative commons attribution-sharealike 4.0",
"creative commons attribution-sharealike 4.0 international",
"creative commons attribution-sharealike 4.0 international license",
"creative commons by-sa 4.0"
]
},
"cc-pdm 1.0": {
"version_key": "cc-pdm-1.0",
"name_key": "cc-pdm",
"family_key": "public-domain",
"aliases": [
"cc-pdm-1.0",
"cc pdm 1.0",
"cc pdm-1.0",
"cc-pdm",
"cc pdm",
"creative commons public domain",
"creative commons public domain mark 1.0",
"creative commons public domain mark"
]
},
"cc0 1.0": {
"version_key": "cc0-1.0",
"name_key": "cc0",
"family_key": "cc0",
"aliases": [
"cc0-1.0",
"cc-zero 1.0",
"cc zero 1.0",
"creative commons zero 1.0",
"cc0",
"cc 0",
"cc zero",
"creative commons zero",
"cc-zero"
]
},
"cup-terms": {
"version_key": "cup-terms",
"name_key": "cup-terms",
"family_key": "publisher-proprietary",
"aliases": [
"cambridge terms"
]
},
"degruyter-terms": {
"version_key": "degruyter-terms",
"name_key": "degruyter-terms",
"family_key": "publisher-proprietary",
"aliases": [
"de gruyter terms"
]
},
"elsevier oa": {
"version_key": "elsevier-oa",
"name_key": "elsevier-oa",
"family_key": "publisher-oa",
"aliases": [
"elsevier-oa",
"elsevier user license"
]
},
"elsevier tdm": {
"version_key": "elsevier-tdm",
"name_key": "elsevier-tdm",
"family_key": "publisher-tdm",
"aliases": [
"elsevier tdmu",
"elsevier-tdm"
]
},
"gpl-2": {
"version_key": "gpl-2.0",
"name_key": "gpl-2",
"family_key": "copyleft",
"aliases": [
"gpl-v2",
"gpl 2",
"gnu gpl v2",
"gpl v2",
"gpl-2.0+"
]
},
"gpl-3": {
"version_key": "gpl-3.0",
"name_key": "gpl-3",
"family_key": "copyleft",
"aliases": [
"gpl-v3",
"gpl v3 only",
"gpl 3",
"gnu gpl",
"gnu gpl v3",
"gpl",
"gpl v3",
"gpl-3.0+"
],
"justification": "gnu gpl, gnu gpl v3, gpl, gpl v3, gpl-3, and gpl-3.0+ are all standard aliases for GPL-3.0."
},
"implied oa": {
"version_key": "implied-oa",
"name_key": "implied-oa",
"family_key": "publisher-oa",
"aliases": [
"implied open access",
"implied-oa"
]
},
"iop-copyright": {
"version_key": "iop-copyright",
"name_key": "iop-copyright",
"family_key": "publisher-proprietary"
},
"iop-tdm": {
"version_key": "iop-tdm",
"name_key": "iop-tdm",
"family_key": "publisher-tdm",
"aliases": [
"iop text and data mining"
]
},
"isc license": {
"version_key": "isc",
"name_key": "isc",
"family_key": "osi"
},
"jama-cc-by": {
"version_key": "jama-cc-by",
"name_key": "jama-cc-by",
"family_key": "publisher-oa",
"aliases": [
"jama open access"
]
},
"lgpl": {
"version_key": "lgpl-3.0",
"name_key": "lgpl-3",
"family_key": "copyleft"
},
"lgpl v2.1": {
"version_key": "lgpl-2.1",
"name_key": "lgpl-2.1",
"family_key": "copyleft"
},
"lgpl v3": {
"version_key": "lgpl-3.0",
"name_key": "lgpl-3",
"family_key": "copyleft"
},
"lgpl-2": {
"version_key": "lgpl-2.1",
"name_key": "lgpl-2.1",
"family_key": "copyleft",
"aliases": [
"lgpl-v2",
"lgpl 2",
"lgpl-2.1-only",
"lgpl-2.1-or-later"
]
},
"lgpl-2.1+": {
"version_key": "lgpl-2.1",
"name_key": "lgpl-2.1",
"family_key": "copyleft"
},
"lgpl-3": {
"version_key": "lgpl-3.0",
"name_key": "lgpl-3",
"family_key": "copyleft",
"aliases": [
"lgpl-v3",
"lgpl 3"
]
},
"lgpl-3.0+": {
"version_key": "lgpl-3.0",
"name_key": "lgpl-3",
"family_key": "copyleft"
},
"mit license": {
"version_key": "mit",
"name_key": "mit",
"family_key": "osi",
"aliases": [
"the mit license"
]
},
"mozilla public license 2.0": {
"version_key": "mpl-2.0",
"name_key": "mpl",
"family_key": "osi",
"aliases": [
"mpl",
"mpl-2.0",
"mpl 2.0",
"mozilla license",
"mozilla public license",
"mozilla"
]
},
"no reuse": {
"version_key": "no-reuse",
"name_key": "no-reuse",
"family_key": "publisher-proprietary",
"aliases": [
"no-reuse"
]
},
"odbl": {
"version_key": "odbl",
"name_key": "odbl",
"family_key": "open-data",
"aliases": [
"open database license"
]
},
"odc-by": {
"version_key": "odc-by",
"name_key": "odc-by",
"family_key": "open-data"
},
"other-oa": {
"version_key": "other-oa",
"name_key": "other-oa",
"family_key": "other-oa",
"aliases": [
"open access",
"open-access"
]
},
"oup-chorus": {
"version_key": "oup-chorus",
"name_key": "oup-chorus",
"family_key": "publisher-oa"
},
"oup-terms": {
"version_key": "oup-terms",
"name_key": "oup-terms",
"family_key": "publisher-proprietary",
"aliases": [
"oup standard publication"
]
},
"pd": {
"version_key": "public-domain",
"name_key": "public-domain",
"family_key": "public-domain",
"aliases": [
"public domain",
"public-domain"
]
},
"pddl": {
"version_key": "pddl",
"name_key": "pddl",
"family_key": "open-data"
},
"pnas terms": {
"version_key": "pnas-licenses",
"name_key": "pnas-licenses",
"family_key": "publisher-proprietary",
"aliases": [
"pnas-licenses"
]
},
"rsc-terms": {
"version_key": "rsc-terms",
"name_key": "rsc-terms",
"family_key": "publisher-proprietary"
},
"sage-permissions": {
"version_key": "sage-permissions",
"name_key": "sage-permissions",
"family_key": "publisher-proprietary"
},
"springer tdm": {
"version_key": "springer-tdm",
"name_key": "springer-tdm",
"family_key": "publisher-tdm",
"aliases": [
"springer-tdm"
]
},
"springernature-tdm": {
"version_key": "springernature-tdm",
"name_key": "springernature-tdm",
"family_key": "publisher-tdm",
"aliases": [
"springer nature tdm",
"springer nature text and data mining"
]
},
"tandf-terms": {
"version_key": "tandf-terms",
"name_key": "tandf-terms",
"family_key": "publisher-proprietary",
"aliases": [
"taylor and francis terms",
"taylor francis terms"
]
},
"thieme nlm": {
"version_key": "thieme-nlm",
"name_key": "thieme-nlm",
"family_key": "publisher-oa",
"aliases": [
"thieme-nlm"
]
},
"unlicense": {
"version_key": "unlicense",
"name_key": "unlicense",
"family_key": "osi"
},
"unspecified oa": {
"version_key": "unspecified-oa",
"name_key": "unspecified-oa",
"family_key": "other-oa",
"aliases": [
"unspecified-oa"
]
},
"wiley terms": {
"version_key": "wiley-terms",
"name_key": "wiley-terms",
"family_key": "publisher-proprietary",
"aliases": [
"wiley-terms"
]
},
"wiley-am": {
"version_key": "wiley-am",
"name_key": "wiley-am",
"family_key": "publisher-proprietary",
"aliases": [
"wiley author manuscript"
]
},
"wiley-tdm": {
"version_key": "wiley-tdm",
"name_key": "wiley-tdm",
"family_key": "publisher-tdm",
"aliases": [
"wiley tdm license"
]
},
"wiley-vor": {
"version_key": "wiley-vor",
"name_key": "wiley-vor",
"family_key": "publisher-proprietary"
},
"wtfpl": {
"version_key": "wtfpl",
"name_key": "wtfpl",
"family_key": "osi"
},
"zlib": {
"version_key": "zlib",
"name_key": "zlib",
"family_key": "osi"
},
"© the author(s)": {
"version_key": "publisher-specific-oa",
"name_key": "publisher-specific-oa",
"family_key": "publisher-oa",
"aliases": [
"publisher specific oa",
"publisher-specific-oa"
]
}
}
src/licence_normaliser/data/prose/prose_patterns.json
src/licence_normaliser/data/prose/prose_patterns.json
[
{"pattern": "cc\\s*by-nc-nd\\s*4\\.0", "version_key": "cc-by-nc-nd-4.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
{"pattern": "cc\\s*by-nc-nd\\s*3\\.0", "version_key": "cc-by-nc-nd-3.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
{"pattern": "cc\\s*by-nc-nd", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"},
{"pattern": "cc\\s*by-nc-sa\\s*4\\.0", "version_key": "cc-by-nc-sa-4.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
{"pattern": "cc\\s*by-nc-sa\\s*3\\.0", "version_key": "cc-by-nc-sa-3.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
{"pattern": "creative\\s+commons\\s+by-nc-sa", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"},
{"pattern": "creative\\s+commons\\s+by-nc-nd", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"},
{"pattern": "creative\\s+commons\\s+by-nc", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"},
{"pattern": "creative\\s+commons\\s+by-nd", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"},
{"pattern": "creative\\s+commons\\s+by-sa", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"},
{"pattern": "creative\\s+commons\\s+by", "version_key": "cc-by", "name_key": "cc-by", "family_key": "cc"},
{"pattern": "cc\\s*by-nc-sa", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"},
{"pattern": "cc\\s*by-nc\\s*4\\.0", "version_key": "cc-by-nc-4.0", "name_key": "cc-by-nc", "family_key": "cc"},
{"pattern": "cc\\s*by-nc\\s*3\\.0", "version_key": "cc-by-nc-3.0", "name_key": "cc-by-nc", "family_key": "cc"},
{"pattern": "cc\\s*by-nc", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"},
{"pattern": "cc\\s*by-sa\\s*4\\.0", "version_key": "cc-by-sa-4.0", "name_key": "cc-by-sa", "family_key": "cc"},
{"pattern": "cc\\s*by-sa\\s*3\\.0", "version_key": "cc-by-sa-3.0", "name_key": "cc-by-sa", "family_key": "cc"},
{"pattern": "cc\\s*by-sa", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"},
{"pattern": "cc\\s*by-nd\\s*4\\.0", "version_key": "cc-by-nd-4.0", "name_key": "cc-by-nd", "family_key": "cc"},
{"pattern": "cc\\s*by-nd\\s*3\\.0", "version_key": "cc-by-nd-3.0", "name_key": "cc-by-nd", "family_key": "cc"},
{"pattern": "cc\\s*by-nd", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"},
{"pattern": "cc\\s*by\\s*4\\.0", "version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc"},
{"pattern": "cc\\s*by\\s*3\\.0", "version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"},
{"pattern": "cc\\s*by\\s*2\\.0", "version_key": "cc-by-2.0", "name_key": "cc-by", "family_key": "cc"},
{"pattern": "\\bcc\\s*by\\b(?!\\s*-)", "version_key": "cc-by", "name_key": "cc-by", "family_key": "cc"},
{"pattern": "\\bcc\\s*0\\b|cc\\s*zero", "version_key": "cc0", "name_key": "cc0", "family_key": "cc0"},
{"pattern": "attribution.{0,30}non.?commercial.{0,30}no.?deriv", "version_key": "cc-by-nc-nd", "name_key": "cc-by-nc-nd", "family_key": "cc"},
{"pattern": "attribution.{0,30}non.?commercial.{0,30}share.?alike", "version_key": "cc-by-nc-sa", "name_key": "cc-by-nc-sa", "family_key": "cc"},
{"pattern": "attribution.{0,30}non.?commercial", "version_key": "cc-by-nc", "name_key": "cc-by-nc", "family_key": "cc"},
{"pattern": "attribution.{0,30}no.?deriv", "version_key": "cc-by-nd", "name_key": "cc-by-nd", "family_key": "cc"},
{"pattern": "attribution.{0,30}share.?alike", "version_key": "cc-by-sa", "name_key": "cc-by-sa", "family_key": "cc"},
{"pattern": "elsevier.*tdm|tdm.*elsevier", "version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"},
{"pattern": "elsevier.*user\\s*licen", "version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"},
{"pattern": "wiley.*tdm|tdm.*wiley", "version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm"},
{"pattern": "springer.*tdm|tdm.*springer", "version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"},
{"pattern": "acs\\s*authorchoice.*cc\\s*by(?!-nc)", "version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa"},
{"pattern": "acs\\s*authorchoice", "version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"},
{"pattern": "all\\s*rights\\s*reserved", "version_key": "all-rights-reserved", "name_key": "all-rights-reserved", "family_key": "publisher-proprietary"},
{"pattern": "author\\s*manuscript", "version_key": "author-manuscript", "name_key": "author-manuscript", "family_key": "publisher-oa"},
{"pattern": "public\\s*domain", "version_key": "public-domain", "name_key": "public-domain", "family_key": "public-domain"},
{"pattern": "open\\s*access", "version_key": "other-oa", "name_key": "other-oa", "family_key": "other-oa"}
]
src/licence_normaliser/data/publishers/publishers.json
src/licence_normaliser/data/publishers/publishers.json
{
"_comment": "Publisher-specific license URLs and shorthand aliases.",
"_comment2": "URLs: normalized to https with no trailing slash on lookup.",
"_comment3": "Aliases: cleaned-lowercase form -> version_key.",
"urls": {
"https://www.elsevier.com/open-access/userlicense/1.0/": {
"version_key": "elsevier-oa",
"name_key": "elsevier-oa",
"family_key": "publisher-oa"
},
"http://www.elsevier.com/open-access/userlicense/1.0/": {
"version_key": "elsevier-oa",
"name_key": "elsevier-oa",
"family_key": "publisher-oa"
},
"https://www.elsevier.com/tdm/userlicense/1.0/": {
"version_key": "elsevier-tdm",
"name_key": "elsevier-tdm",
"family_key": "publisher-tdm"
},
"http://www.elsevier.com/tdm/userlicense/1.0/": {
"version_key": "elsevier-tdm",
"name_key": "elsevier-tdm",
"family_key": "publisher-tdm"
},
"http://doi.wiley.com/10.1002/tdm_license_1": {
"version_key": "wiley-tdm",
"name_key": "wiley-tdm",
"family_key": "publisher-tdm"
},
"http://doi.wiley.com/10.1002/tdm_license_1.1": {
"version_key": "wiley-tdm-1.1",
"name_key": "wiley-tdm",
"family_key": "publisher-tdm"
},
"http://onlinelibrary.wiley.com/termsAndConditions#vor": {
"version_key": "wiley-vor",
"name_key": "wiley-vor",
"family_key": "publisher-proprietary"
},
"http://onlinelibrary.wiley.com/termsAndConditions#am": {
"version_key": "wiley-am",
"name_key": "wiley-am",
"family_key": "publisher-proprietary"
},
"https://onlinelibrary.wiley.com/termsandconditions#vor": {
"version_key": "wiley-vor",
"name_key": "wiley-vor",
"family_key": "publisher-proprietary"
},
"https://onlinelibrary.wiley.com/termsandconditions#am": {
"version_key": "wiley-am",
"name_key": "wiley-am",
"family_key": "publisher-proprietary"
},
"https://onlinelibrary.wiley.com/termsandconditions": {
"version_key": "wiley-terms",
"name_key": "wiley-terms",
"family_key": "publisher-proprietary"
},
"https://onlinelibrary.wiley.com/terms-and-conditions": {
"version_key": "wiley-terms",
"name_key": "wiley-terms",
"family_key": "publisher-proprietary"
},
"https://www.springer.com/tdm": {
"version_key": "springer-tdm",
"name_key": "springer-tdm",
"family_key": "publisher-tdm"
},
"http://www.springer.com/tdm": {
"version_key": "springer-tdm",
"name_key": "springer-tdm",
"family_key": "publisher-tdm"
},
"https://www.springernature.com/gp/researchers/text-and-data-mining": {
"version_key": "springernature-tdm",
"name_key": "springernature-tdm",
"family_key": "publisher-tdm"
},
"https://www.tandfonline.com/action/showCopyRight": {
"version_key": "tandf-terms",
"name_key": "tandf-terms",
"family_key": "publisher-proprietary"
},
"https://www.tandfonline.com/action/showcopyright": {
"version_key": "tandf-terms",
"name_key": "tandf-terms",
"family_key": "publisher-proprietary"
},
"https://tandfonline.com/action/showcopyright": {
"version_key": "tandf-terms",
"name_key": "tandf-terms",
"family_key": "publisher-proprietary"
},
"https://www.tandfonline.com/action/showcopyright?show=full": {
"version_key": "tandf-terms",
"name_key": "tandf-terms",
"family_key": "publisher-proprietary"
},
"https://us.sagepub.com/en-us/nam/journals-permissions": {
"version_key": "sage-permissions",
"name_key": "sage-permissions",
"family_key": "publisher-proprietary"
},
"https://www.sagepub.com/journalspermissions.nav": {
"version_key": "sage-permissions",
"name_key": "sage-permissions",
"family_key": "publisher-proprietary"
},
"https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {
"version_key": "acs-authorchoice-ccby",
"name_key": "acs-authorchoice-ccby",
"family_key": "publisher-oa"
},
"http://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {
"version_key": "acs-authorchoice-ccby",
"name_key": "acs-authorchoice-ccby",
"family_key": "publisher-oa"
},
"https://pubs.acs.org/page/policy/authorchoice_ccbyncnd_termsofuse.html": {
"version_key": "acs-authorchoice-ccbyncnd",
"name_key": "acs-authorchoice-ccbyncnd",
"family_key": "publisher-oa"
},
"https://pubs.acs.org/page/policy/authorchoice_termsofuse.html": {
"version_key": "acs-authorchoice",
"name_key": "acs-authorchoice",
"family_key": "publisher-oa"
},
"https://pubs.acs.org/page/policy/acs_authorchoice_with_nih_addendum_termsofuse.html": {
"version_key": "acs-authorchoice-nih",
"name_key": "acs-authorchoice-nih",
"family_key": "publisher-oa"
},
"https://doi.org/10.1021/policy/oa-license": {
"version_key": "acs-authorchoice",
"name_key": "acs-authorchoice",
"family_key": "publisher-oa"
},
"https://www.rsc.org/journals-books-databases/journal-authors-reviewers/licences-copyright-permissions/": {
"version_key": "rsc-terms",
"name_key": "rsc-terms",
"family_key": "publisher-proprietary"
},
"https://www.rsc.org/help/disclaimer/pages/term3.aspx": {
"version_key": "rsc-terms",
"name_key": "rsc-terms",
"family_key": "publisher-proprietary"
},
"https://iopscience.iop.org/info/page/text-and-data-mining": {
"version_key": "iop-tdm",
"name_key": "iop-tdm",
"family_key": "publisher-tdm"
},
"http://iopscience.iop.org/info/page/text-and-data-mining": {
"version_key": "iop-tdm",
"name_key": "iop-tdm",
"family_key": "publisher-tdm"
},
"https://iopscience.iop.org/page/copyright": {
"version_key": "iop-copyright",
"name_key": "iop-copyright",
"family_key": "publisher-proprietary"
},
"https://www.bmj.com/company/legal-stuff/copyright-notice/": {
"version_key": "bmj-copyright",
"name_key": "bmj-copyright",
"family_key": "publisher-proprietary"
},
"https://group.bmj.com/group/rights-licensing/permissions": {
"version_key": "bmj-copyright",
"name_key": "bmj-copyright",
"family_key": "publisher-proprietary"
},
"https://www.science.org/content/page/science-licenses-journal-article-reuse": {
"version_key": "aaas-author-reuse",
"name_key": "aaas-author-reuse",
"family_key": "publisher-proprietary"
},
"https://www.sciencemag.org/about/science-licenses-journal-article-reuse": {
"version_key": "aaas-author-reuse",
"name_key": "aaas-author-reuse",
"family_key": "publisher-proprietary"
},
"https://www.pnas.org/site/aboutpnas/licenses.xhtml": {
"version_key": "pnas-licenses",
"name_key": "pnas-licenses",
"family_key": "publisher-proprietary"
},
"https://link.aps.org/licenses/aps-default-license": {
"version_key": "aps-default",
"name_key": "aps-default",
"family_key": "publisher-proprietary"
},
"https://link.aps.org/licenses/aps-default-text-mining-license": {
"version_key": "aps-tdm",
"name_key": "aps-tdm",
"family_key": "publisher-tdm"
},
"https://www.cambridge.org/core/terms": {
"version_key": "cup-terms",
"name_key": "cup-terms",
"family_key": "publisher-proprietary"
},
"https://publishing.aip.org/authors/rights-and-permissions": {
"version_key": "aip-rights",
"name_key": "aip-rights",
"family_key": "publisher-proprietary"
},
"http://publishing.aip.org/authors/rights-and-permissions": {
"version_key": "aip-rights",
"name_key": "aip-rights",
"family_key": "publisher-proprietary"
},
"https://jamanetwork.com/pages/cc-by-license-permissions": {
"version_key": "jama-cc-by",
"name_key": "jama-cc-by",
"family_key": "publisher-oa"
},
"https://www.degruyter.com/dg/page/496": {
"version_key": "degruyter-terms",
"name_key": "degruyter-terms",
"family_key": "publisher-proprietary"
},
"https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model": {
"version_key": "oup-chorus",
"name_key": "oup-chorus",
"family_key": "publisher-oa"
},
"https://academic.oup.com/pages/standard-publication-reuse-rights": {
"version_key": "oup-terms",
"name_key": "oup-terms",
"family_key": "publisher-proprietary"
},
"https://www.gnu.org/licenses/gpl-2.0.html": {
"version_key": "gpl-2.0",
"name_key": "gpl-2",
"family_key": "copyleft"
},
"https://www.gnu.org/licenses/gpl-3.0.html": {
"version_key": "gpl-3.0",
"name_key": "gpl-3",
"family_key": "copyleft"
},
"https://www.gnu.org/licenses/agpl-3.0.html": {
"version_key": "agpl-3.0",
"name_key": "agpl-3",
"family_key": "copyleft"
},
"https://www.gnu.org/licenses/lgpl-2.1.html": {
"version_key": "lgpl-2.1",
"name_key": "lgpl-2.1",
"family_key": "copyleft"
},
"https://www.gnu.org/licenses/lgpl-3.0.html": {
"version_key": "lgpl-3.0",
"name_key": "lgpl-3",
"family_key": "copyleft"
},
"https://opendatacommons.org/licenses/odbl/1-0/": {
"version_key": "odbl",
"name_key": "odbl",
"family_key": "open-data"
},
"https://opendatacommons.org/licenses/by/1-0/": {
"version_key": "odc-by",
"name_key": "odc-by",
"family_key": "open-data"
},
"https://opendatacommons.org/licenses/pddl/1-0/": {
"version_key": "pddl",
"name_key": "pddl",
"family_key": "open-data"
}
},
"shorthand_aliases": {
"elsevier user license": "elsevier-oa",
"elsevier tdm": "elsevier-tdm",
"elsevier tdmu": "elsevier-tdm",
"wiley tdm license": "wiley-tdm",
"wiley tdm": "wiley-tdm",
"wiley vor": "wiley-vor",
"wiley am": "wiley-am",
"wiley author manuscript": "wiley-am",
"springer tdm": "springer-tdm",
"springer nature tdm": "springernature-tdm",
"springer nature text and data mining": "springernature-tdm",
"tandf terms": "tandf-terms",
"taylor and francis terms": "tandf-terms",
"taylor francis terms": "tandf-terms",
"sage permissions": "sage-permissions",
"acs authorchoice": "acs-authorchoice",
"acs author choice": "acs-authorchoice",
"acs authorchoice cc by": "acs-authorchoice-ccby",
"acs authorchoice cc by nc nd": "acs-authorchoice-ccbyncnd",
"acs authorchoice nih": "acs-authorchoice-nih",
"rsc terms": "rsc-terms",
"rsc copyright": "rsc-terms",
"iop tdm": "iop-tdm",
"iop text and data mining": "iop-tdm",
"iop copyright": "iop-copyright",
"bmj copyright": "bmj-copyright",
"bmj permissions": "bmj-copyright",
"aaas author reuse": "aaas-author-reuse",
"aaas reuse": "aaas-author-reuse",
"science author reuse": "aaas-author-reuse",
"pnas licenses": "pnas-licenses",
"pnas terms": "pnas-licenses",
"aps default": "aps-default",
"aps tdm": "aps-tdm",
"aps text mining": "aps-tdm",
"aps default license": "aps-default",
"cambridge terms": "cup-terms",
"cup terms": "cup-terms",
"aip rights": "aip-rights",
"aip permissions": "aip-rights",
"jama cc by": "jama-cc-by",
"jama open access": "jama-cc-by",
"degruyter terms": "degruyter-terms",
"de gruyter terms": "degruyter-terms",
"oup chorus": "oup-chorus",
"oup terms": "oup-terms",
"oup standard publication": "oup-terms",
"thieme nlm": "thieme-nlm",
"implied oa": "implied-oa",
"implied open access": "implied-oa",
"unspecified oa": "unspecified-oa",
"publisher specific oa": "publisher-specific-oa",
"author manuscript": "author-manuscript",
"all rights reserved": "all-rights-reserved",
"no reuse": "no-reuse",
"public domain": "public-domain",
"open access": "other-oa",
"creative commons public domain": "cc-pdm-1.0",
"pd": "public-domain"
}
}
src/licence_normaliser/data/urls/url_map.json
src/licence_normaliser/data/urls/url_map.json
{
"_comment": "URL -> metadata dict. Both http and https variants may be listed.",
"_comment2": "Normalisation (https, no trailing slash) is applied on load.",
"https://creativecommons.org/licenses/by/4.0/": {"version_key": "cc-by-4.0", "name_key": "cc-by", "family_key": "cc"},
"https://creativecommons.org/licenses/by/3.0/": {"version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"},
"https://creativecommons.org/licenses/by/2.5/": {"version_key": "cc-by-2.5", "name_key": "cc-by", "family_key": "cc"},
"https://creativecommons.org/licenses/by/2.0/": {"version_key": "cc-by-2.0", "name_key": "cc-by", "family_key": "cc"},
"https://creativecommons.org/licenses/by/1.0/": {"version_key": "cc-by-1.0", "name_key": "cc-by", "family_key": "cc"},
"https://creativecommons.org/licenses/by/3.0/deed.en_us": {"version_key": "cc-by-3.0", "name_key": "cc-by", "family_key": "cc"},
"https://creativecommons.org/licenses/by-sa/4.0/": {"version_key": "cc-by-sa-4.0", "name_key": "cc-by-sa", "family_key": "cc"},
"https://creativecommons.org/licenses/by-sa/3.0/": {"version_key": "cc-by-sa-3.0", "name_key": "cc-by-sa", "family_key": "cc"},
"https://creativecommons.org/licenses/by-sa/2.5/": {"version_key": "cc-by-sa-2.5", "name_key": "cc-by-sa", "family_key": "cc"},
"https://creativecommons.org/licenses/by-sa/2.0/": {"version_key": "cc-by-sa-2.0", "name_key": "cc-by-sa", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nd/4.0/": {"version_key": "cc-by-nd-4.0", "name_key": "cc-by-nd", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nd/3.0/": {"version_key": "cc-by-nd-3.0", "name_key": "cc-by-nd", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nd/2.0/": {"version_key": "cc-by-nd-2.0", "name_key": "cc-by-nd", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc/4.0/": {"version_key": "cc-by-nc-4.0", "name_key": "cc-by-nc", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc/3.0/": {"version_key": "cc-by-nc-3.0", "name_key": "cc-by-nc", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc/2.5/": {"version_key": "cc-by-nc-2.5", "name_key": "cc-by-nc", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc/2.0/": {"version_key": "cc-by-nc-2.0", "name_key": "cc-by-nc", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-sa/4.0/": {"version_key": "cc-by-nc-sa-4.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-sa/3.0/": {"version_key": "cc-by-nc-sa-3.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-sa/2.5/": {"version_key": "cc-by-nc-sa-2.5", "name_key": "cc-by-nc-sa", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-sa/2.0/": {"version_key": "cc-by-nc-sa-2.0", "name_key": "cc-by-nc-sa", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-nd/4.0/": {"version_key": "cc-by-nc-nd-4.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-nd/3.0/": {"version_key": "cc-by-nc-nd-3.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-nd/2.5/": {"version_key": "cc-by-nc-nd-2.5", "name_key": "cc-by-nc-nd", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-nd/2.0/": {"version_key": "cc-by-nc-nd-2.0", "name_key": "cc-by-nc-nd", "family_key": "cc"},
"https://creativecommons.org/licenses/by/3.0/igo/": {"version_key": "cc-by-3.0-igo", "name_key": "cc-by-igo", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-sa/3.0/igo/": {"version_key": "cc-by-nc-sa-3.0-igo", "name_key": "cc-by-nc-sa-igo", "family_key": "cc"},
"https://creativecommons.org/licenses/by-nc-nd/3.0/igo/": {"version_key": "cc-by-nc-nd-3.0-igo", "name_key": "cc-by-nc-nd-igo", "family_key": "cc"},
"https://creativecommons.org/publicdomain/zero/1.0/": {"version_key": "cc0", "name_key": "cc0", "family_key": "cc0"},
"https://creativecommons.org/publicdomain/mark/1.0/": {"version_key": "cc-pdm", "name_key": "cc-pdm", "family_key": "public-domain"},
"https://www.gnu.org/licenses/gpl-2.0.html": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"},
"https://www.gnu.org/licenses/gpl-2.0": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"},
"http://www.gnu.org/licenses/gpl-2.0.html": {"version_key": "gpl-2.0", "name_key": "gpl-2", "family_key": "copyleft"},
"https://www.gnu.org/licenses/gpl-3.0.html": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"},
"https://www.gnu.org/licenses/gpl-3.0": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"},
"http://www.gnu.org/licenses/gpl-3.0.html": {"version_key": "gpl-3.0", "name_key": "gpl-3", "family_key": "copyleft"},
"https://www.gnu.org/licenses/agpl-3.0.html": {"version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft"},
"https://www.gnu.org/licenses/agpl-3.0": {"version_key": "agpl-3.0", "name_key": "agpl-3", "family_key": "copyleft"},
"https://www.gnu.org/licenses/lgpl-2.1.html": {"version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft"},
"https://www.gnu.org/licenses/lgpl-2.1": {"version_key": "lgpl-2.1", "name_key": "lgpl-2.1", "family_key": "copyleft"},
"https://www.gnu.org/licenses/lgpl-3.0.html": {"version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft"},
"https://www.gnu.org/licenses/lgpl-3.0": {"version_key": "lgpl-3.0", "name_key": "lgpl-3", "family_key": "copyleft"},
"https://opensource.org/licenses/MIT": {"version_key": "mit", "name_key": "mit", "family_key": "osi"},
"https://www.apache.org/licenses/LICENSE-2.0": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"},
"https://www.apache.org/licenses/LICENSE-2.0.html": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"},
"https://opensource.org/licenses/Apache-2.0": {"version_key": "apache-2.0", "name_key": "apache", "family_key": "osi"},
"https://opensource.org/licenses/BSD-2-Clause": {"version_key": "bsd-2-clause", "name_key": "bsd-2-clause", "family_key": "osi"},
"https://opensource.org/licenses/BSD-3-Clause": {"version_key": "bsd-3-clause", "name_key": "bsd-3-clause", "family_key": "osi"},
"https://opensource.org/licenses/ISC": {"version_key": "isc", "name_key": "isc", "family_key": "osi"},
"https://www.mozilla.org/en-US/MPL/2.0/": {"version_key": "mpl-2.0", "name_key": "mpl", "family_key": "osi"},
"https://www.mozilla.org/MPL/2.0/": {"version_key": "mpl-2.0", "name_key": "mpl", "family_key": "osi"},
"https://www.elsevier.com/open-access/userlicense/1.0/": {"version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"},
"http://www.elsevier.com/open-access/userlicense/1.0/": {"version_key": "elsevier-oa", "name_key": "elsevier-oa", "family_key": "publisher-oa"},
"https://www.elsevier.com/tdm/userlicense/1.0/": {"version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"},
"http://www.elsevier.com/tdm/userlicense/1.0/": {"version_key": "elsevier-tdm", "name_key": "elsevier-tdm", "family_key": "publisher-tdm"},
"http://doi.wiley.com/10.1002/tdm_license_1": {"version_key": "wiley-tdm", "name_key": "wiley-tdm", "family_key": "publisher-tdm"},
"http://doi.wiley.com/10.1002/tdm_license_1.1": {"version_key": "wiley-tdm-1.1", "name_key": "wiley-tdm", "family_key": "publisher-tdm"},
"http://onlinelibrary.wiley.com/termsAndConditions#vor": {"version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary"},
"http://onlinelibrary.wiley.com/termsAndConditions#am": {"version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary"},
"https://onlinelibrary.wiley.com/termsandconditions#vor": {"version_key": "wiley-vor", "name_key": "wiley-vor", "family_key": "publisher-proprietary"},
"https://onlinelibrary.wiley.com/termsandconditions#am": {"version_key": "wiley-am", "name_key": "wiley-am", "family_key": "publisher-proprietary"},
"https://onlinelibrary.wiley.com/termsandconditions": {"version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary"},
"https://onlinelibrary.wiley.com/terms-and-conditions": {"version_key": "wiley-terms", "name_key": "wiley-terms", "family_key": "publisher-proprietary"},
"https://www.springer.com/tdm": {"version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"},
"http://www.springer.com/tdm": {"version_key": "springer-tdm", "name_key": "springer-tdm", "family_key": "publisher-tdm"},
"https://www.springernature.com/gp/researchers/text-and-data-mining": {"version_key": "springernature-tdm", "name_key": "springernature-tdm", "family_key": "publisher-tdm"},
"https://www.tandfonline.com/action/showCopyRight": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
"https://www.tandfonline.com/action/showcopyright": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
"https://tandfonline.com/action/showcopyright": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
"https://www.tandfonline.com/action/showcopyright?show=full": {"version_key": "tandf-terms", "name_key": "tandf-terms", "family_key": "publisher-proprietary"},
"https://us.sagepub.com/en-us/nam/journals-permissions": {"version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary"},
"https://www.sagepub.com/journalspermissions.nav": {"version_key": "sage-permissions", "name_key": "sage-permissions", "family_key": "publisher-proprietary"},
"https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html": {"version_key": "acs-authorchoice-ccby", "name_key": "acs-authorchoice-ccby", "family_key": "publisher-oa"},
"https://pubs.acs.org/page/policy/authorchoice_ccbyncnd_termsofuse.html": {"version_key": "acs-authorchoice-ccbyncnd", "name_key": "acs-authorchoice-ccbyncnd", "family_key": "publisher-oa"},
"https://pubs.acs.org/page/policy/authorchoice_termsofuse.html": {"version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"},
"https://pubs.acs.org/page/policy/acs_authorchoice_with_nih_addendum_termsofuse.html": {"version_key": "acs-authorchoice-nih", "name_key": "acs-authorchoice-nih", "family_key": "publisher-oa"},
"https://doi.org/10.1021/policy/oa-license": {"version_key": "acs-authorchoice", "name_key": "acs-authorchoice", "family_key": "publisher-oa"},
"https://www.rsc.org/journals-books-databases/journal-authors-reviewers/licences-copyright-permissions/": {"version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary"},
"https://www.rsc.org/help/disclaimer/pages/term3.aspx": {"version_key": "rsc-terms", "name_key": "rsc-terms", "family_key": "publisher-proprietary"},
"https://iopscience.iop.org/info/page/text-and-data-mining": {"version_key": "iop-tdm", "name_key": "iop-tdm", "family_key": "publisher-tdm"},
"https://iopscience.iop.org/page/copyright": {"version_key": "iop-copyright", "name_key": "iop-copyright", "family_key": "publisher-proprietary"},
"https://www.bmj.com/company/legal-stuff/copyright-notice/": {"version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary"},
"https://group.bmj.com/group/rights-licensing/permissions": {"version_key": "bmj-copyright", "name_key": "bmj-copyright", "family_key": "publisher-proprietary"},
"https://www.science.org/content/page/science-licenses-journal-article-reuse": {"version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary"},
"https://www.sciencemag.org/about/science-licenses-journal-article-reuse": {"version_key": "aaas-author-reuse", "name_key": "aaas-author-reuse", "family_key": "publisher-proprietary"},
"https://www.pnas.org/site/aboutpnas/licenses.xhtml": {"version_key": "pnas-licenses", "name_key": "pnas-licenses", "family_key": "publisher-proprietary"},
"https://link.aps.org/licenses/aps-default-license": {"version_key": "aps-default", "name_key": "aps-default", "family_key": "publisher-proprietary"},
"https://link.aps.org/licenses/aps-default-text-mining-license": {"version_key": "aps-tdm", "name_key": "aps-tdm", "family_key": "publisher-tdm"},
"https://www.cambridge.org/core/terms": {"version_key": "cup-terms", "name_key": "cup-terms", "family_key": "publisher-proprietary"},
"https://publishing.aip.org/authors/rights-and-permissions": {"version_key": "aip-rights", "name_key": "aip-rights", "family_key": "publisher-proprietary"},
"https://jamanetwork.com/pages/cc-by-license-permissions": {"version_key": "jama-cc-by", "name_key": "jama-cc-by", "family_key": "publisher-oa"},
"https://www.degruyter.com/dg/page/496": {"version_key": "degruyter-terms", "name_key": "degruyter-terms", "family_key": "publisher-proprietary"},
"https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model": {"version_key": "oup-chorus", "name_key": "oup-chorus", "family_key": "publisher-oa"},
"https://academic.oup.com/pages/standard-publication-reuse-rights": {"version_key": "oup-terms", "name_key": "oup-terms", "family_key": "publisher-proprietary"},
"https://opendatacommons.org/licenses/odbl/1-0/": {"version_key": "odbl", "name_key": "odbl", "family_key": "open-data"},
"https://opendatacommons.org/licenses/by/1-0/": {"version_key": "odc-by", "name_key": "odc-by", "family_key": "open-data"},
"https://opendatacommons.org/licenses/pddl/1-0/": {"version_key": "pddl", "name_key": "pddl", "family_key": "open-data"}
}
src/licence_normaliser/defaults.py
src/licence_normaliser/defaults.py
"""Default plugin configuration.
These are the plugin CLASSES (not instances) that form the sane defaults.
Pass them to LicenseNormaliser - they're instantiated lazily.
"""
from __future__ import annotations
from collections.abc import Mapping
from typing import Iterator
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
"DEFAULT_PLUGINS",
"DEFAULT_PLUGIN_KEYS",
"get_all_refreshable_plugins",
)
DEFAULT_PLUGIN_KEYS = ("registry", "url", "alias", "family", "name", "prose")
def get_all_refreshable_plugins() -> list[type]:
"""Return all plugin classes that support refresh (have url set)."""
from .parsers.creativecommons import CreativeCommonsParser
from .parsers.opendefinition import OpenDefinitionParser
from .parsers.osi import OSIParser
from .parsers.scancode_licensedb import ScanCodeLicenseDBParser
from .parsers.spdx import SPDXParser
return [
SPDXParser,
OpenDefinitionParser,
OSIParser,
ScanCodeLicenseDBParser,
CreativeCommonsParser,
]
def _load_registry_plugins() -> list[type]:
from .parsers.creativecommons import CreativeCommonsParser
from .parsers.opendefinition import OpenDefinitionParser
from .parsers.osi import OSIParser
from .parsers.scancode_licensedb import ScanCodeLicenseDBParser
from .parsers.spdx import SPDXParser
return [
SPDXParser,
OpenDefinitionParser,
OSIParser,
ScanCodeLicenseDBParser,
CreativeCommonsParser,
]
def _load_url_plugins() -> list[type]:
from .parsers.creativecommons import CreativeCommonsParser
from .parsers.opendefinition import OpenDefinitionParser
from .parsers.osi import OSIParser
from .parsers.publisher import PublisherParser
from .parsers.spdx import SPDXParser
return [
SPDXParser,
OpenDefinitionParser,
OSIParser,
CreativeCommonsParser,
PublisherParser,
]
def _load_alias_plugins() -> list[type]:
from .parsers.alias import AliasParser
from .parsers.publisher import PublisherParser
# PublisherParser first, then AliasParser - AliasParser values take precedence
return [PublisherParser, AliasParser]
def _load_family_plugins() -> list[type]:
from .parsers.alias import AliasParser
return [AliasParser]
def _load_name_plugins() -> list[type]:
from .parsers.alias import AliasParser
return [AliasParser]
def _load_prose_plugins() -> list[type]:
from .parsers.prose import ProseParser
return [ProseParser]
# Lazy-loaded bundle - functions delay imports until actually needed
class _LazyDefaults:
"""Lazy-loading container for default plugins."""
_registry: list[type] | None = None
_url: list[type] | None = None
_alias: list[type] | None = None
_family: list[type] | None = None
_name: list[type] | None = None
_prose: list[type] | None = None
@property
def registry(self) -> list[type]:
if self._registry is None:
self._registry = _load_registry_plugins()
return self._registry
@property
def url(self) -> list[type]:
if self._url is None:
self._url = _load_url_plugins()
return self._url
@property
def alias(self) -> list[type]:
if self._alias is None:
self._alias = _load_alias_plugins()
return self._alias
@property
def family(self) -> list[type]:
if self._family is None:
self._family = _load_family_plugins()
return self._family
@property
def name(self) -> list[type]:
if self._name is None:
self._name = _load_name_plugins()
return self._name
@property
def prose(self) -> list[type]:
if self._prose is None:
self._prose = _load_prose_plugins()
return self._prose
_LAZY = _LazyDefaults()
# Convenience accessors - these trigger lazy loading
def get_default_registry() -> list[type]:
return _LAZY.registry
def get_default_url() -> list[type]:
return _LAZY.url
def get_default_alias() -> list[type]:
return _LAZY.alias
def get_default_family() -> list[type]:
return _LAZY.family
def get_default_name() -> list[type]:
return _LAZY.name
def get_default_prose() -> list[type]:
return _LAZY.prose
class _LazyPluginsBundle:
"""Lazy-loading bundle - defers plugin loading until accessed."""
_cache: dict[str, list[type]] = {}
def _get_registry(self) -> list[type]:
return get_default_registry()
def _get_url(self) -> list[type]:
return get_default_url()
def _get_alias(self) -> list[type]:
return get_default_alias()
def _get_family(self) -> list[type]:
return get_default_family()
def _get_name(self) -> list[type]:
return get_default_name()
def _get_prose(self) -> list[type]:
return get_default_prose()
def __getitem__(self, key: str) -> list[type]:
if key not in self._cache:
fn = getattr(self, f"_get_{key}", None)
if fn is None:
raise KeyError(key)
self._cache[key] = fn()
return self._cache[key]
_DEFAULT_PLUGINS_BUNDLE = _LazyPluginsBundle()
class _DefaultPlugins(Mapping):
"""Lazy dict-like accessor for default plugins."""
def __getitem__(self, key: str) -> list[type]:
return _DEFAULT_PLUGINS_BUNDLE[key]
def keys(self) -> tuple[str, ...]:
return DEFAULT_PLUGIN_KEYS
def values(self) -> list[list[type]]:
return [self[k] for k in self.keys()]
def items(self) -> list[tuple[str, list[type]]]:
return [(k, self[k]) for k in self.keys()]
def __iter__(self) -> Iterator[str]:
return iter(self.keys())
def __len__(self) -> int:
return 6
def __contains__(self, key: str) -> bool:
return key in self.keys()
def copy(self) -> dict:
return dict(self.items())
DEFAULT_PLUGINS = _DefaultPlugins()
src/licence_normaliser/exceptions.py
src/licence_normaliser/exceptions.py
"""licence_normaliser.exceptions - public exception hierarchy.
These are the only exceptions that cross the public API boundary.
All internal errors are wrapped before propagation.
"""
from __future__ import annotations
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
"DataSourceError",
"LicenseNormalisationError",
"LicenseNormaliserError",
"LicenseNotFoundError",
)
class LicenseNormaliserError(Exception):
"""Base exception for all licence-normaliser errors."""
class LicenseNotFoundError(LicenseNormaliserError):
"""Raised in strict mode when a license string cannot be resolved."""
def __init__(self, raw: str, cleaned: str) -> None:
self.raw = raw
self.cleaned = cleaned
super().__init__(
f"License not found: {raw!r} (cleaned: {cleaned!r}). "
"Pass strict=False to return an 'unknown' result instead."
)
class DataSourceError(LicenseNormaliserError):
"""Raised when a data source file cannot be loaded or parsed."""
class LicenseNormalisationError(ValueError):
"""Raised when ``strict=True`` and no canonical license could be resolved."""
src/licence_normaliser/parsers/__init__.py
src/licence_normaliser/parsers/__init__.py
src/licence_normaliser/parsers/alias.py
src/licence_normaliser/parsers/alias.py
"""Alias parser - loads aliases.json with rich metadata for aliases/family overrides.
Each entry may carry an optional ``aliases`` list of extra lookup keys that all
resolve to the same ``version_key``. This lets data authors enumerate explicit
variants (e.g. hyphen vs space forms) without any auto-generation magic::
"cc by-nc": {
"version_key": "cc-by-nc",
"name_key": "cc-by-nc",
"family_key": "cc",
"aliases": ["cc-by-nc", "cc by nc", "cc-by nc"]
}
All keys in ``aliases`` inherit the same ``version_key``, ``name_key``, and
``family_key`` as the primary entry.
"""
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
from licence_normaliser.plugins import AliasPlugin, BasePlugin, FamilyPlugin, NamePlugin
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("AliasParser",)
def _iter_entries(
data: dict[str, Any],
) -> list[tuple[str, dict[str, Any]]]:
"""Yield (key, meta) pairs, expanding ``aliases`` sub-keys.
For every primary entry that has an ``"aliases"`` list, each alias key is
emitted as an additional entry with the same metadata dict (minus the
``aliases`` field itself, to keep things tidy).
"""
results: list[tuple[str, dict[str, Any]]] = []
for primary_key, meta in data.items():
if primary_key.startswith("_"):
continue
if not isinstance(meta, dict):
continue
version_key = meta.get("version_key", "")
if not version_key:
continue
results.append((primary_key, meta))
# Expand explicit alias variants
for extra_key in meta.get("aliases", []):
if not isinstance(extra_key, str) or not extra_key:
continue
if extra_key == primary_key:
continue # already emitted
# Build a slim copy without the aliases list to avoid recursion
slim_meta = {k: v for k, v in meta.items() if k != "aliases"}
results.append((extra_key, slim_meta))
return results
class AliasParser(BasePlugin, AliasPlugin, FamilyPlugin, NamePlugin):
url = None
local_path = "data/aliases/aliases.json"
def _load_data(self) -> dict[str, Any]:
path = Path(__file__).parent.parent / self.local_path
return json.loads(path.read_text(encoding="utf-8"))
def parse(self) -> list[tuple[str, dict[str, Any]]]:
return _iter_entries(self._load_data())
def load_aliases(self) -> dict[str, str]:
aliases: dict[str, str] = {}
for alias_key, meta in _iter_entries(self._load_data()):
version_key = meta.get("version_key", "")
if version_key:
aliases[alias_key] = version_key
return aliases
def load_aliases_with_lines(
self,
) -> dict[str, tuple[str, int]]:
"""Load aliases with their source line numbers.
Extra keys from ``aliases`` lists are reported at the line of their
primary entry (best approximation without per-alias line tracking).
Returns:
dict mapping alias_key -> (version_key, line_number)
"""
path = Path(__file__).parent.parent / self.local_path
content = path.read_text(encoding="utf-8")
data: dict[str, Any] = json.loads(content)
lines = content.splitlines()
result: dict[str, tuple[str, int]] = {}
for primary_key, meta in data.items():
if primary_key.startswith("_"):
continue
if not isinstance(meta, dict):
continue
version_key = meta.get("version_key", "")
if not version_key:
continue
# Find line of the primary key
primary_line = 1
for i, line in enumerate(lines, start=1):
if f'"{primary_key}"' in line:
primary_line = i
break
result[primary_key] = (version_key, primary_line)
for extra_key in meta.get("aliases", []):
if not isinstance(extra_key, str) or not extra_key:
continue
if extra_key == primary_key:
continue
result[extra_key] = (version_key, primary_line)
return result
def load_families(self) -> dict[str, str]:
data = self._load_data()
overrides: dict[str, str] = {}
for meta in data.values():
if not isinstance(meta, dict):
continue
vk = meta.get("version_key", "")
fk = meta.get("family_key", "")
if vk and fk:
overrides[vk] = fk
return overrides
def load_names(self) -> dict[str, str]:
data = self._load_data()
names: dict[str, str] = {}
for meta in data.values():
if not isinstance(meta, dict):
continue
vk = meta.get("version_key", "")
nk = meta.get("name_key", "")
if vk and nk:
names[vk] = nk
return names
src/licence_normaliser/parsers/creativecommons.py
src/licence_normaliser/parsers/creativecommons.py
"""Creative Commons parser - scrapes creativecommons.org for multilingual deed URLs."""
from __future__ import annotations
import json
import re
import urllib.request
from html.parser import HTMLParser
from pathlib import Path
from typing import Any
from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin
CC_LICENSE_RE = re.compile(
r"^(by|by-nc|by-nc-nd|by-nc-sa|by-nd|by-sa|"
r"zero|pdmark|devnations|"
r"nc|nd|sa|sampling|nc-sa|sampling\+|nc-sampling\+|nd-nc)"
r"/([\d.]+)"
r"(/igo)?"
r"(/deed\.\w+)?$",
)
VERSION_RE = re.compile(r"^[\d.]+$")
def _path_to_license_key(path: str) -> str | None:
m = CC_LICENSE_RE.match(path)
if not m:
return None
lic_type, version, igo = m.group(1), m.group(2), m.group(3)
prefix_map = {
"by": "cc-by",
"by-nc": "cc-by-nc",
"by-nc-nd": "cc-by-nc-nd",
"by-nc-sa": "cc-by-nc-sa",
"by-nd": "cc-by-nd",
"by-sa": "cc-by-sa",
"zero": "cc0",
"pdmark": "cc-pdm",
"devnations": "cc-devnations",
"nc": "cc-nc",
"nd": "cc-nd",
"sa": "cc-sa",
"sampling": "cc-sampling",
"nc-sa": "cc-nc-sa",
"sampling+": "cc-sampling-plus",
"nc-sampling+": "cc-nc-sampling-plus",
"nd-nc": "cc-nd-nc",
}
prefix = prefix_map.get(lic_type)
if not prefix:
return None
suffix = "igo" if igo else ""
key = f"{prefix}-{version}" if VERSION_RE.match(version) else prefix
if suffix:
key = f"{key}-{suffix}"
return key.lower()
class CCLinkParser(HTMLParser):
def __init__(self) -> None:
super().__init__()
self.in_td = False
self.current_cell = ""
self.current_row: list[str] = []
self.rows: list[list[str]] = []
def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
if tag == "td":
self.in_td = True
self.current_cell = ""
elif tag == "a" and self.in_td:
href = dict(attrs).get("href") or ""
if href:
self.current_cell += " AHREF:" + href
def handle_endtag(self, tag: str) -> None:
if tag == "td":
self.in_td = False
self.current_row.append(self.current_cell.strip())
elif tag == "tr":
if self.current_row:
self.rows.append(self.current_row)
self.current_row = []
def handle_data(self, data: str) -> None:
if self.in_td:
self.current_cell += data
def _fetch_html(url: str) -> str:
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(req, timeout=30) as response: # noqa: S310
return response.read().decode("utf-8")
JURISDICTION_CODES = {
"au",
"at",
"be",
"br",
"ca",
"ch",
"cl",
"cn",
"co",
"cz",
"de",
"dk",
"ee",
"eg",
"es",
"fi",
"fr",
"gb",
"gr",
"hr",
"hu",
"id",
"ie",
"il",
"in",
"ir",
"is",
"it",
"jp",
"kr",
"lt",
"lu",
"lv",
"ma",
"mt",
"mx",
"my",
"nl",
"no",
"nz",
"pe",
"ph",
"pl",
"pt",
"ro",
"rs",
"ru",
"se",
"si",
"sk",
"th",
"tr",
"tw",
"ua",
"ug",
"us",
"za",
"vn",
}
def _is_international(href: str) -> bool:
parts = href.split("/")
return not any(p in JURISDICTION_CODES for p in parts[1:])
def _extract_deeds(html: str) -> set[str]:
parser = CCLinkParser()
parser.feed(html)
deeds: set[str] = set()
for row in parser.rows:
if not row:
continue
jurisdiction = row[0]
if jurisdiction != "English":
continue
for cell in row[1:]:
for part in cell.split():
if part.startswith("AHREF:"):
href = part[6:]
if href and _is_international(href):
deeds.add(href)
return deeds
def _scrape() -> list[dict[str, str]]:
pages = [
"https://creativecommons.org/licenses/list.en",
"https://creativecommons.org/publicdomain/list.en",
]
all_deeds: set[str] = set()
try:
for page_url in pages:
html = _fetch_html(page_url)
all_deeds |= _extract_deeds(html)
except Exception:
pass
entries: list[dict[str, str]] = []
seen_keys: set[str] = set()
for href in sorted(all_deeds):
lic_key = _path_to_license_key(href)
if not lic_key:
continue
url_path = href.rsplit("/deed.", 1)[0]
url = f"https://creativecommons.org/licenses/{url_path}/"
if lic_key in seen_keys:
continue
seen_keys.add(lic_key)
entries.append({"license_key": lic_key, "url": url, "path": url_path})
return entries
class CreativeCommonsParser(BasePlugin, RegistryPlugin, URLPlugin):
id = "creativecommons"
url = "https://creativecommons.org/licenses/list.en"
local_path = "data/creativecommons/creativecommons.json"
def parse(self) -> list[tuple[str, dict[str, Any]]]:
path = Path(__file__).parent.parent / self.local_path
if not path.exists():
return []
data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
return [
(
entry["license_key"],
{
"url": entry["url"],
"name": entry["license_key"],
"path": entry["path"],
},
)
for entry in data
if "license_key" in entry
]
def load_registry(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
if not path.exists():
return {}
data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
for entry in data:
key = entry.get("license_key", "")
if key:
result[key.lower().strip()] = key.lower().strip()
return result
def load_urls(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
if not path.exists():
return {}
data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
for entry in data:
key = entry.get("license_key", "")
if not key:
continue
canonical = key.lower().strip()
raw_url = entry.get("url", "")
if not raw_url:
continue
clean = raw_url.strip().lower().rstrip("/")
if clean.startswith("http://"):
clean = "https://" + clean[7:]
result[clean] = canonical
return result
@classmethod
def refresh(cls, force: bool = False) -> bool:
target = Path(__file__).parent.parent / cls.local_path
if target.exists() and not force:
return True
try:
data = _scrape()
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(
json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8"
)
return True
except Exception:
return False
src/licence_normaliser/parsers/opendefinition.py
src/licence_normaliser/parsers/opendefinition.py
"""OpenDefinition parser - loads opendefinition_licenses_all.json from package data."""
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("OpenDefinitionParser",)
class OpenDefinitionParser(BasePlugin, RegistryPlugin, URLPlugin):
id = "opendefinition"
url = "https://licenses.opendefinition.org/licenses/groups/all.json"
local_path = "data/opendefinition/opendefinition.json"
def parse(self) -> list[tuple[str, dict[str, Any]]]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
results: list[tuple[str, dict[str, Any]]] = []
for entry in data.values():
if not isinstance(entry, dict):
continue
lid = entry.get("id", "")
url = entry.get("url", "")
results.append((lid, {"url": url, "title": entry.get("title", "")}))
return results
def load_registry(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
for entry in data.values():
if not isinstance(entry, dict):
continue
lid = entry.get("id", "")
if lid:
result[lid.lower().strip()] = lid.lower().strip()
return result
def load_urls(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
for entry in data.values():
if not isinstance(entry, dict):
continue
lid = entry.get("id", "")
if not lid:
continue
canonical = lid.lower().strip()
raw_url = entry.get("url", "")
if not raw_url:
continue
clean = raw_url.strip().lower().rstrip("/")
if clean.startswith("http://"):
clean = "https://" + clean[7:]
result[clean] = canonical
return result
src/licence_normaliser/parsers/osi.py
src/licence_normaliser/parsers/osi.py
"""OSI parser - loads osi.json from package data."""
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("OSIParser",)
class OSIParser(BasePlugin, RegistryPlugin, URLPlugin):
id = "osi"
url = "https://opensource.org/api/license"
local_path = "data/osi/osi.json"
def parse(self) -> list[tuple[str, dict[str, Any]]]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
results: list[tuple[str, dict[str, Any]]] = []
if not isinstance(data, list):
return results
for entry in data:
if not isinstance(entry, dict):
continue
key = entry.get("id", "")
if not key:
continue
links = entry.get("_links", {})
html_link = links.get("html", {})
url = html_link.get("href", "") if isinstance(html_link, dict) else ""
results.append(
(
key,
{
"url": url,
"name": entry.get("name", ""),
"spdx_id": entry.get("spdx_id", ""),
},
)
)
return results
def load_registry(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
if not isinstance(data, list):
return result
for entry in data:
if not isinstance(entry, dict):
continue
key = entry.get("id", "").strip()
if key:
result[key.lower()] = key.lower()
return result
def load_urls(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
if not isinstance(data, list):
return result
for entry in data:
if not isinstance(entry, dict):
continue
key = entry.get("id", "").strip()
if not key:
continue
canonical = key.lower()
links = entry.get("_links", {})
html_link = links.get("html", {})
raw_url = html_link.get("href", "") if isinstance(html_link, dict) else ""
if not raw_url:
continue
clean = raw_url.strip().lower().rstrip("/")
if clean.startswith("http://"):
clean = "https://" + clean[7:]
result[clean] = canonical
return result
src/licence_normaliser/parsers/prose.py
src/licence_normaliser/parsers/prose.py
"""Prose pattern parser - loads prose_patterns.json and compiles regex patterns."""
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Any
from licence_normaliser.plugins import BasePlugin, ProsePlugin
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("ProseParser",)
_COMPILED_PATTERNS: list[tuple[re.Pattern[str], str]] = []
class ProseParser(BasePlugin, ProsePlugin):
is_registry_entry = False
url = None
local_path = "data/prose/prose_patterns.json"
def parse(self) -> list[tuple[str, dict[str, Any]]]:
path = Path(__file__).parent.parent / self.local_path
data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
global _COMPILED_PATTERNS
_COMPILED_PATTERNS = []
results: list[tuple[str, dict[str, Any]]] = []
for entry in data:
pattern_str = entry.get("pattern", "")
version_key = entry.get("version_key", "")
name_key = entry.get("name_key", "")
family_key = entry.get("family_key", "")
if pattern_str and version_key:
compiled = re.compile(pattern_str, re.IGNORECASE)
_COMPILED_PATTERNS.append((compiled, version_key))
results.append(
(
pattern_str,
{
"pattern": compiled,
"version_key": version_key,
"name_key": name_key,
"family_key": family_key,
},
)
)
return results
def load_prose(self) -> list[tuple[re.Pattern[str], str]]:
global _COMPILED_PATTERNS
_COMPILED_PATTERNS = []
path = Path(__file__).parent.parent / self.local_path
data: list[dict[str, str]] = json.loads(path.read_text(encoding="utf-8"))
for entry in data:
pattern_str = entry.get("pattern", "")
version_key = entry.get("version_key", "")
if pattern_str and version_key:
compiled = re.compile(pattern_str, re.IGNORECASE)
_COMPILED_PATTERNS.append((compiled, version_key))
return _COMPILED_PATTERNS
def load_prose_with_lines(self) -> list[tuple[re.Pattern[str], str, int]]:
"""Load prose patterns with their source line numbers.
Returns:
list of (compiled_pattern, version_key, line_number)
"""
path = Path(__file__).parent.parent / self.local_path
content = path.read_text(encoding="utf-8")
data: list[dict[str, str]] = json.loads(content)
lines = content.splitlines()
result: list[tuple[re.Pattern[str], str, int]] = []
for entry in data:
pattern_str = entry.get("pattern", "")
version_key = entry.get("version_key", "")
if pattern_str and version_key:
compiled = re.compile(pattern_str, re.IGNORECASE)
serialized = json.dumps(pattern_str)
line_num = 1
for i, line in enumerate(lines, start=1):
if '"pattern"' in line and serialized[:30] in line:
line_num = i
break
result.append((compiled, version_key, line_num))
return result
def get_prose_patterns() -> list[tuple[re.Pattern[str], str]]:
"""Legacy helper: return the compiled prose patterns."""
return _COMPILED_PATTERNS
src/licence_normaliser/parsers/publisher.py
src/licence_normaliser/parsers/publisher.py
"""Publisher parser - loads publishers.json with URLs and shorthand aliases."""
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
from licence_normaliser.plugins import AliasPlugin, BasePlugin, URLPlugin
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("PublisherParser",)
class PublisherParser(BasePlugin, AliasPlugin, URLPlugin):
url = None
local_path = "data/publishers/publishers.json"
def parse(self) -> list[tuple[str, dict[str, Any]]]:
path = Path(__file__).parent.parent / self.local_path
data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8"))
results: list[tuple[str, dict[str, Any]]] = []
urls: dict[str, dict[str, str]] = data.get("urls", {})
for url, meta in urls.items():
if isinstance(meta, dict):
results.append((url, meta))
return results
def load_aliases(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8"))
aliases: dict[str, str] = data.get("shorthand_aliases", {})
return dict(aliases)
def load_aliases_with_lines(self) -> dict[str, tuple[str, int]]:
"""Load shorthand aliases with their source line numbers."""
path = Path(__file__).parent.parent / self.local_path
content = path.read_text(encoding="utf-8")
data: dict[str, Any] = json.loads(content)
lines = content.splitlines()
result: dict[str, tuple[str, int]] = {}
for alias_key, version_key in data.get("shorthand_aliases", {}).items():
for i, line in enumerate(lines, start=1):
if f'"{alias_key}"' in line:
result[alias_key] = (version_key, i)
break
return result
def load_urls(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
data: dict[str, Any] = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
for url, meta in data.get("urls", {}).items():
if not isinstance(meta, dict):
continue
vk = meta.get("version_key", "")
if not vk:
continue
clean = url.strip().lower().rstrip("/")
if clean.startswith("http://"):
clean = "https://" + clean[7:]
result[clean] = vk
return result
def load_urls_with_lines(self) -> dict[str, tuple[str, int]]:
"""Load URLs with their source line numbers."""
path = Path(__file__).parent.parent / self.local_path
content = path.read_text(encoding="utf-8")
data: dict[str, Any] = json.loads(content)
lines = content.splitlines()
result: dict[str, tuple[str, int]] = {}
for url, meta in data.get("urls", {}).items():
if not isinstance(meta, dict):
continue
vk = meta.get("version_key", "")
if not vk:
continue
clean = url.strip().lower().rstrip("/")
if clean.startswith("http://"):
clean = "https://" + clean[7:]
for i, line in enumerate(lines, start=1):
if f'"{url}"' in line or f'"{clean}"' in line:
result[clean] = (vk, i)
break
return result
src/licence_normaliser/parsers/scancode_licensedb.py
src/licence_normaliser/parsers/scancode_licensedb.py
"""ScanCode-licensedb parser - loads scancode_licensedb.json from package data."""
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
from licence_normaliser.plugins import BasePlugin, RegistryPlugin
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("ScanCodeLicenseDBParser",)
class ScanCodeLicenseDBParser(BasePlugin, RegistryPlugin):
id = "scancode-licensedb"
url = "https://scancode-licensedb.aboutcode.org/index.json"
local_path = "data/scancode_licensedb/scancode_licensedb.json"
def parse(self) -> list[tuple[str, dict[str, Any]]]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
results: list[tuple[str, dict[str, Any]]] = []
if not isinstance(data, list):
return results
for entry in data:
if not isinstance(entry, dict):
continue
key = entry.get("license_key", "")
if not key:
continue
if key.lower() == "unknown":
continue
spdx_key = entry.get("spdx_license_key")
category = entry.get("category", "")
results.append(
(
key,
{
"url": "",
"name": key,
"category": category,
"spdx_license_key": spdx_key if spdx_key else "",
},
)
)
return results
def load_registry(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
if not isinstance(data, list):
return result
for entry in data:
if not isinstance(entry, dict):
continue
key = entry.get("license_key", "")
if key and key.lower() != "unknown":
result[key.lower().strip()] = key.lower().strip()
return result
src/licence_normaliser/parsers/spdx.py
src/licence_normaliser/parsers/spdx.py
"""SPDX parser - loads spdx-licenses.json from package data."""
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
from licence_normaliser.plugins import BasePlugin, RegistryPlugin, URLPlugin
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = ("SPDXParser",)
class SPDXParser(BasePlugin, RegistryPlugin, URLPlugin):
id = "spdx"
url = "https://raw.githubusercontent.com/spdx/license-list-data/main/json/licenses.json"
local_path = "data/spdx/spdx.json"
def parse(self) -> list[tuple[str, dict[str, Any]]]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
results: list[tuple[str, dict[str, Any]]] = []
for entry in data.get("licenses", []):
if not isinstance(entry, dict):
continue
lid = entry.get("licenseId", "")
urls = entry.get("seeAlso", [])
url = urls[0] if urls else ""
results.append((lid, {"url": url, "name": entry.get("name", "")}))
return results
def load_registry(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
for entry in data.get("licenses", []):
if not isinstance(entry, dict):
continue
lid = entry.get("licenseId", "")
if lid:
result[lid.lower().strip()] = lid.lower().strip()
return result
def load_urls(self) -> dict[str, str]:
path = Path(__file__).parent.parent / self.local_path
data = json.loads(path.read_text(encoding="utf-8"))
result: dict[str, str] = {}
for entry in data.get("licenses", []):
if not isinstance(entry, dict):
continue
lid = entry.get("licenseId", "")
if not lid:
continue
canonical = lid.lower().strip()
for raw_url in entry.get("seeAlso", []):
if not raw_url:
continue
clean = raw_url.strip().lower().rstrip("/")
if clean.startswith("http://"):
clean = "https://" + clean[7:]
result[clean] = canonical
return result
src/licence_normaliser/plugins.py
src/licence_normaliser/plugins.py
"""Simple plugin interface definitions.
Each plugin is a callable that returns a dict or list of tuples.
Plugins are passed as CLASSES (not instances) - they're instantiated lazily.
"""
from __future__ import annotations
import json
import logging
import re
import urllib.error
import urllib.request
from pathlib import Path
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
__all__ = (
"AliasPlugin",
"BasePlugin",
"FamilyPlugin",
"NamePlugin",
"ProsePlugin",
"RegistryPlugin",
"URLPlugin",
)
class BasePlugin:
"""Base class for all plugins with refresh capability."""
url: str | None = None
local_path: str = ""
@classmethod
def refresh(cls, force: bool = False) -> bool:
"""Fetch fresh data from ``cls.url`` and write to ``cls.local_path``.
The local path is resolved relative to the package root
(``src/licence_normaliser/``).
If ``cls.url`` is None, this is a local-only parser with no external
source and the operation succeeds without fetching.
Returns True on success, False on failure.
"""
if not cls.local_path:
return False
target = Path(__file__).parent / cls.local_path
if target.exists() and not force:
return True
if cls.url is None:
return True
try:
with urllib.request.urlopen(cls.url, timeout=30) as response: # noqa: S310
raw_bytes = response.read()
json.loads(raw_bytes.decode("utf-8"))
target.parent.mkdir(parents=True, exist_ok=True)
target.write_bytes(raw_bytes)
return True
except urllib.error.URLError as exc:
logging.warning(
"refresh(%s): URLError fetching %s - %s", cls.__name__, cls.url, exc
)
return False
except urllib.error.HTTPError as exc:
logging.warning(
"refresh(%s): HTTPError %s fetching %s", cls.__name__, exc.code, cls.url
)
return False
except json.JSONDecodeError as exc:
logging.error(
"refresh(%s): invalid JSON from %s - %s", cls.__name__, cls.url, exc
)
return False
except OSError as exc:
logging.error(
"refresh(%s): OSError writing %s - %s", cls.__name__, target, exc
)
return False
class RegistryPlugin:
"""Returns key -> canonical_key mappings."""
def load_registry(self) -> dict[str, str]:
raise NotImplementedError
class URLPlugin:
"""Returns cleaned_url -> version_key mappings."""
def load_urls(self) -> dict[str, str]:
raise NotImplementedError
class AliasPlugin:
"""Returns alias_string -> version_key mappings."""
def load_aliases(self) -> dict[str, str]:
raise NotImplementedError
class FamilyPlugin:
"""Returns version_key -> family_key mappings."""
def load_families(self) -> dict[str, str]:
raise NotImplementedError
class NamePlugin:
"""Returns version_key -> name_key mappings."""
def load_names(self) -> dict[str, str]:
raise NotImplementedError
class ProsePlugin:
"""Returns list of (compiled_pattern, version_key) for prose matching."""
def load_prose(self) -> list[tuple[re.Pattern[str], str]]:
raise NotImplementedError
src/licence_normaliser/tests/__init__.py
src/licence_normaliser/tests/__init__.py
"""Tests for licence_normaliser."""
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
src/licence_normaliser/tests/conftest.py
src/licence_normaliser/tests/conftest.py
"""Shared fixtures for licence_normaliser tests."""
import pytest
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
@pytest.fixture()
def mit_raw() -> str:
return "MIT"
@pytest.fixture()
def cc_by_nc_nd_4_raw() -> str:
return "CC BY-NC-ND 4.0"
@pytest.fixture()
def batch_raw() -> list[str]:
return ["MIT", "Apache-2.0", "CC BY 4.0"]
src/licence_normaliser/tests/test_aliases.py
src/licence_normaliser/tests/test_aliases.py
"""Tests for AliasParser - non-CC aliases (Apache, MIT, BSD, GPL, etc.)."""
from licence_normaliser import normalise_license
class TestNonCCAliases:
def test_apache_shorthand(self):
v = normalise_license("apache")
assert v.key == "apache-2.0"
assert v.family.key == "osi"
def test_apache_license(self):
v = normalise_license("apache license")
assert v.key == "apache-2.0"
assert v.family.key == "osi"
def test_apache_2(self):
v = normalise_license("apache 2")
assert v.key == "apache-2.0"
assert v.family.key == "osi"
def test_apache_2_0(self):
v = normalise_license("apache 2.0")
assert v.key == "apache-2.0"
assert v.family.key == "osi"
def test_mit_license(self):
v = normalise_license("mit license")
assert v.key == "mit"
assert v.family.key == "osi"
def test_the_mit_license(self):
v = normalise_license("the mit license")
assert v.key == "mit"
assert v.family.key == "osi"
def test_bsd_shorthand(self):
v = normalise_license("bsd")
assert v.key == "bsd-3-clause"
assert v.family.key == "osi"
def test_bsd_license(self):
v = normalise_license("bsd license")
assert v.key == "bsd-3-clause"
assert v.family.key == "osi"
def test_mozilla(self):
v = normalise_license("mozilla")
assert v.key == "mpl-2.0"
assert v.family.key == "osi"
def test_isc_license(self):
v = normalise_license("isc license")
assert v.key == "isc"
assert v.family.key == "osi"
def test_gpl_shorthand(self):
v = normalise_license("gpl")
assert v.key == "gpl-3.0"
assert v.family.key == "copyleft"
def test_gnu_gpl(self):
v = normalise_license("gnu gpl")
assert v.key == "gpl-3.0"
assert v.family.key == "copyleft"
def test_gnu_gpl_v2(self):
v = normalise_license("gnu gpl v2")
assert v.key == "gpl-2.0"
assert v.family.key == "copyleft"
def test_gpl_3_0_or_later(self):
v = normalise_license("gpl-3.0+")
assert v.key == "gpl-3.0"
assert v.family.key == "copyleft"
def test_gpl_2_0_or_later(self):
v = normalise_license("gpl-2.0+")
assert v.key == "gpl-2.0"
assert v.family.key == "copyleft"
def test_agpl_shorthand(self):
v = normalise_license("agpl")
assert v.key == "agpl-3.0"
assert v.family.key == "copyleft"
def test_agpl_3_0_or_later(self):
v = normalise_license("agpl-3.0+")
assert v.key == "agpl-3.0"
assert v.family.key == "copyleft"
def test_lgpl_shorthand(self):
v = normalise_license("lgpl")
assert v.key == "lgpl-3.0"
assert v.family.key == "copyleft"
def test_lgpl_2_1_or_later(self):
v = normalise_license("lgpl-2.1+")
assert v.key == "lgpl-2.1"
assert v.family.key == "copyleft"
def test_lgpl_3_0_or_later(self):
v = normalise_license("lgpl-3.0+")
assert v.key == "lgpl-3.0"
assert v.family.key == "copyleft"
def test_unlicense(self):
v = normalise_license("unlicense")
assert v.key == "unlicense"
assert v.family.key == "osi"
def test_wtfpl(self):
v = normalise_license("wtfpl")
assert v.key == "wtfpl"
assert v.family.key == "osi"
def test_zlib(self):
v = normalise_license("zlib")
assert v.key == "zlib"
assert v.family.key == "osi"
def test_open_database_license(self):
v = normalise_license("open database license")
assert v.key == "odbl"
assert v.family.key == "open-data"
def test_public_domain(self):
v = normalise_license("public domain")
assert v.key == "public-domain"
assert v.family.key == "public-domain"
def test_pd_alias(self):
v = normalise_license("pd")
assert v.key == "public-domain"
assert v.family.key == "public-domain"
src/licence_normaliser/tests/test_cache.py
src/licence_normaliser/tests/test_cache.py
"""Tests for _cache.py - thread-safe default normaliser singleton."""
from __future__ import annotations
import threading
from concurrent.futures import ThreadPoolExecutor
from licence_normaliser._cache import (
_DefaultNormaliser,
get_registry_keys,
normalise_license,
normalise_licenses,
)
from licence_normaliser._normaliser import LicenseNormaliser
class TestDefaultNormaliserSingleton:
def test_singleton_instance_reused(self) -> None:
d1 = _DefaultNormaliser()
d2 = _DefaultNormaliser()
assert d1.get() is d2.get()
def test_get_returns_licence_normaliser(self) -> None:
d = _DefaultNormaliser()
instance = d.get()
assert isinstance(instance, LicenseNormaliser)
def test_thread_safety_same_instance(self) -> None:
results: list[object | None] = [None] * 20
errors: list[BaseException | None] = [None] * 20
def get_instance(idx: int) -> None:
try:
d = _DefaultNormaliser()
results[idx] = d.get()
except BaseException as e: # noqa: BLE001
errors[idx] = e
threads = [threading.Thread(target=get_instance, args=(i,)) for i in range(20)]
for t in threads:
t.start()
for t in threads:
t.join()
assert all(e is None for e in errors)
assert results[0] is not None
assert all(r is results[0] for r in results if r is not None)
def test_concurrent_normalise_license(self) -> None:
licenses = ["MIT", "Apache-2.0", "CC BY 4.0", "GPL-3.0", "BSD-3-Clause"]
def normalise(lic: str) -> str:
v = normalise_license(lic)
return v.key
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(normalise, lic) for lic in licenses * 4]
results = [f.result(timeout=5) for f in futures]
assert len(results) == len(licenses) * 4
assert set(results) == {
"mit",
"apache-2.0",
"cc-by-4.0",
"gpl-3.0",
"bsd-3-clause",
}
class TestModuleLevelAPI:
def test_normalise_license_returns_license_version(self) -> None:
v = normalise_license("MIT")
assert str(v) == "mit"
def test_normalise_licenses_returns_list(self) -> None:
results = normalise_licenses(["MIT", "Apache-2.0"])
assert len(results) == 2
assert all(str(r) in ("mit", "apache-2.0") for r in results)
def test_get_registry_keys_returns_set_of_strings(self) -> None:
keys = get_registry_keys()
assert isinstance(keys, set)
assert len(keys) > 1000
assert "mit" in keys
assert "apache-2.0" in keys
src/licence_normaliser/tests/test_cli.py
src/licence_normaliser/tests/test_cli.py
"""Tests for licence_normaliser CLI - includes new --strict flag."""
from unittest.mock import patch
import pytest
from licence_normaliser.cli._main import main
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
class TestNormaliseCommand:
def test_normalise_mit(self, capsys):
with patch("sys.argv", ["licence-normaliser", "normalise", "MIT"]):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 0
assert capsys.readouterr().out.strip() == "mit"
def test_normalise_full(self, capsys):
with patch(
"sys.argv", ["licence-normaliser", "normalise", "--full", "CC BY 4.0"]
):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 0
out = capsys.readouterr().out
assert "Key: cc-by-4.0" in out
assert "License: cc-by" in out
assert "Family: cc" in out
def test_normalise_cc_url(self, capsys):
with patch(
"sys.argv",
[
"licence-normaliser",
"normalise",
"http://creativecommons.org/licenses/by/4.0/",
],
):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 0
assert capsys.readouterr().out.strip() == "cc-by-4.0"
def test_normalise_unknown(self, capsys):
with patch(
"sys.argv", ["licence-normaliser", "normalise", "totally-unknown-xyz"]
):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 0
assert "totally-unknown-xyz" in capsys.readouterr().out
def test_normalise_strict_known(self, capsys):
with patch("sys.argv", ["licence-normaliser", "normalise", "--strict", "MIT"]):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 0
assert capsys.readouterr().out.strip() == "mit"
def test_normalise_strict_unknown_exits_1(self, capsys):
with patch(
"sys.argv",
["licence-normaliser", "normalise", "--strict", "totally-unknown-xyz-9999"],
):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 1
assert capsys.readouterr().err # error message on stderr
class TestBatchCommand:
def test_batch_basic(self, capsys):
with patch(
"sys.argv",
["licence-normaliser", "batch", "MIT", "Apache-2.0", "CC BY 4.0"],
):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 0
out = capsys.readouterr().out
assert "MIT: mit" in out
assert "Apache-2.0: apache-2.0" in out
assert "CC BY 4.0: cc-by-4.0" in out
def test_batch_strict_all_known(self, capsys):
with patch(
"sys.argv", ["licence-normaliser", "batch", "--strict", "MIT", "GPL-3.0"]
):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 0
def test_batch_strict_with_unknown_exits_1(self, capsys):
with patch(
"sys.argv",
["licence-normaliser", "batch", "--strict", "MIT", "no-such-license-xyz"],
):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 1
class TestVersionFlag:
def test_version_flag(self, capsys):
with patch("sys.argv", ["licence-normaliser", "--version"]):
with pytest.raises(SystemExit) as exc_info:
main()
assert exc_info.value.code == 0
assert "licence-normaliser" in capsys.readouterr().out
src/licence_normaliser/tests/test_core.py
src/licence_normaliser/tests/test_core.py
"""End-to-end pipeline tests via the public API."""
from licence_normaliser import normalise_license, normalise_licenses
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
class TestDirectLookup:
def test_mit(self):
v = normalise_license("mit")
assert v.key == "mit"
assert v.family.key == "osi"
def test_apache(self):
v = normalise_license("apache-2.0")
assert v.key == "apache-2.0"
assert v.family.key == "osi"
def test_cc_by_4_0(self):
v = normalise_license("cc-by-4.0")
assert v.key == "cc-by-4.0"
assert v.family.key == "cc"
def test_cc_by_nc_nd_4_0(self):
v = normalise_license("cc-by-nc-nd-4.0")
assert v.key == "cc-by-nc-nd-4.0"
assert v.family.key == "cc"
def test_cc0_1_0(self):
v = normalise_license("cc0-1.0")
assert v.key == "cc0-1.0"
assert v.family.key == "cc0"
def test_gpl_3_0(self):
v = normalise_license("gpl-3.0")
assert v.key == "gpl-3.0"
assert v.family.key == "copyleft"
def test_gpl_2_0_only(self):
v = normalise_license("gpl-2.0-only")
assert v.key == "gpl-2.0-only"
assert v.family.key == "copyleft"
def test_lgpl_2_1(self):
v = normalise_license("lgpl-2.1")
assert v.key == "lgpl-2.1"
assert v.family.key == "copyleft"
def test_agpl_3_0(self):
v = normalise_license("agpl-3.0")
assert v.key == "agpl-3.0"
assert v.family.key == "copyleft"
def test_bsd_3_clause(self):
v = normalise_license("bsd-3-clause")
assert v.key == "bsd-3-clause"
assert v.family.key == "osi"
def test_isc(self):
v = normalise_license("isc")
assert v.key == "isc"
assert v.family.key == "osi"
def test_mpl_2_0(self):
v = normalise_license("mpl-2.0")
assert v.key == "mpl-2.0"
assert v.family.key == "osi"
def test_unlicense(self):
v = normalise_license("unlicense")
assert v.key == "unlicense"
assert v.family.key == "osi"
def test_wtfpl(self):
v = normalise_license("wtfpl")
assert v.key == "wtfpl"
assert v.family.key == "osi"
def test_zlib(self):
v = normalise_license("zlib")
assert v.key == "zlib"
assert v.family.key == "osi"
def test_odbl_1_0(self):
v = normalise_license("odbl-1.0")
assert v.key == "odbl-1.0"
assert v.family.key == "open-data"
def test_pddl_1_0(self):
v = normalise_license("pddl-1.0")
assert v.key == "pddl-1.0"
assert v.family.key == "data"
def test_odc_by_1_0(self):
v = normalise_license("odc-by-1.0")
assert v.key == "odc-by-1.0"
assert v.family.key == "open-data"
def test_unknown(self):
v = normalise_license("unknown")
assert v.key == "unknown"
assert v.family.key == "unknown"
def test_case_insensitive(self):
v = normalise_license("MIT")
assert v.key == "mit"
v = normalise_license("Apache-2.0")
assert v.key == "apache-2.0"
class TestBuiltinAliases:
def test_cc_by(self):
assert normalise_license("CC BY").key == "cc-by"
def test_cc_by_4_0(self):
assert normalise_license("CC BY 4.0").key == "cc-by-4.0"
def test_cc_by_nc_nd_4_0(self):
assert normalise_license("CC BY-NC-ND 4.0").key == "cc-by-nc-nd-4.0"
def test_cc_by_nc_sa_4_0(self):
assert normalise_license("CC BY-NC-SA 4.0").key == "cc-by-nc-sa-4.0"
def test_cc0_1_0(self):
assert normalise_license("CC0 1.0").key == "cc0-1.0"
def test_public_domain(self):
assert normalise_license("public domain").key == "public-domain"
class TestUrlLookup:
def test_cc_by_https(self):
v = normalise_license("https://creativecommons.org/licenses/by/4.0/")
assert v.key == "cc-by-4.0"
def test_cc_by_http(self):
v = normalise_license("http://creativecommons.org/licenses/by/4.0/")
assert v.key == "cc-by-4.0"
def test_cc_by_no_trailing_slash(self):
v = normalise_license("https://creativecommons.org/licenses/by/4.0")
assert v.key == "cc-by-4.0"
def test_mit_url(self):
v = normalise_license("https://opensource.org/licenses/MIT")
assert v.key == "mit"
class TestFamilyInference:
def test_cc_family(self):
v = normalise_license("cc-by-4.0")
assert v.family.key == "cc"
def test_cc0_family(self):
v = normalise_license("cc0-1.0")
assert v.family.key == "cc0"
def test_copyleft_family(self):
assert normalise_license("gpl-3.0").family.key == "copyleft"
assert normalise_license("agpl-3.0").family.key == "copyleft"
assert normalise_license("lgpl-2.1").family.key == "copyleft"
def test_osi_family(self):
assert normalise_license("mit").family.key == "osi"
assert normalise_license("apache-2.0").family.key == "osi"
assert normalise_license("bsd-3-clause").family.key == "osi"
def test_data_family(self):
assert normalise_license("pddl-1.0").family.key == "data"
class TestNameInference:
def test_cc_name_strips_version(self):
assert normalise_license("cc-by-4.0").license.key == "cc-by"
assert normalise_license("cc-by-nc-nd-4.0").license.key == "cc-by-nc-nd"
assert normalise_license("cc-by-sa-3.0").license.key == "cc-by-sa"
assert normalise_license("cc0-1.0").license.key == "cc0"
assert normalise_license("cc-by-nc-sa-4.0").license.key == "cc-by-nc-sa"
def test_non_cc_keeps_key(self):
assert normalise_license("mit").license.key == "mit"
assert normalise_license("gpl-3.0").license.key == "gpl-3"
class TestHierarchyNavigation:
def test_version_license_family_chain(self):
v = normalise_license("CC BY-NC-ND 4.0")
assert v.key == "cc-by-nc-nd-4.0"
assert v.license.key == "cc-by-nc-nd"
assert v.license.family.key == "cc"
assert v.family.key == "cc"
def test_str_representations(self):
v = normalise_license("CC BY-NC-ND 4.0")
assert str(v) == "cc-by-nc-nd-4.0"
assert str(v.license) == "cc-by-nc-nd"
assert str(v.family) == "cc"
class TestFallback:
def test_unknown_string(self):
v = normalise_license("some-totally-unknown-license-xyz")
assert v.key == "some-totally-unknown-license-xyz"
assert v.family.key == "unknown"
def test_empty_string(self):
v = normalise_license("")
assert v.key == "unknown"
def test_whitespace_only(self):
v = normalise_license(" ")
assert v.key == "unknown"
class TestBatchNormalisation:
def test_basic_batch(self):
results = normalise_licenses(["MIT", "Apache-2.0", "CC BY 4.0"])
assert [r.key for r in results] == ["mit", "apache-2.0", "cc-by-4.0"]
def test_batch_preserves_order(self):
raw = ["GPL-3.0", "MIT", "CC BY 4.0", "Apache-2.0"]
expected = ["gpl-3.0", "mit", "cc-by-4.0", "apache-2.0"]
assert [r.key for r in normalise_licenses(raw)] == expected
def test_batch_accepts_generator(self):
results = normalise_licenses(x for x in ["MIT", "ISC"])
assert results[0].key == "mit"
def test_batch_empty(self):
assert normalise_licenses([]) == []
src/licence_normaliser/tests/test_exceptions.py
src/licence_normaliser/tests/test_exceptions.py
"""Tests for strict mode and the public exception hierarchy."""
import pytest
from licence_normaliser import normalise_license, normalise_licenses
from licence_normaliser.exceptions import (
LicenseNormaliserError,
LicenseNotFoundError,
)
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
class TestLicenseNotFoundError:
def test_is_subclass_of_base(self):
assert issubclass(LicenseNotFoundError, LicenseNormaliserError)
def test_is_subclass_of_exception(self):
assert issubclass(LicenseNotFoundError, Exception)
def test_attributes(self):
exc = LicenseNotFoundError("My License", "my license")
assert exc.raw == "My License"
assert exc.cleaned == "my license"
def test_str_contains_raw(self):
exc = LicenseNotFoundError("My License", "my license")
assert "My License" in str(exc)
def test_str_mentions_strict_false(self):
exc = LicenseNotFoundError("x", "x")
assert "strict=False" in str(exc)
class TestStrictModeNormalise:
def test_known_license_no_raise(self):
# Known licenses must not raise in strict mode
v = normalise_license("MIT", strict=True)
assert v.key == "mit"
def test_unknown_raises_license_not_found(self):
with pytest.raises(LicenseNotFoundError) as exc_info:
normalise_license("totally-unknown-xyz-9999", strict=True)
assert exc_info.value.raw == "totally-unknown-xyz-9999"
assert exc_info.value.cleaned == "totally-unknown-xyz-9999"
def test_empty_string_raises(self):
with pytest.raises(LicenseNotFoundError):
normalise_license("", strict=True)
def test_whitespace_only_raises(self):
with pytest.raises(LicenseNotFoundError):
normalise_license(" ", strict=True)
def test_cc_url_known_no_raise(self):
v = normalise_license(
"https://creativecommons.org/licenses/by/4.0/", strict=True
)
assert v.key == "cc-by-4.0"
def test_strict_false_unknown_returns_unknown(self):
# Default (strict=False): silently returns unknown
v = normalise_license("no-such-license-xyzzy", strict=False)
assert v.family.key == "unknown"
def test_strict_default_is_false(self):
# Calling without strict kwarg should not raise
v = normalise_license("no-such-license-xyzzy")
assert v.family.key == "unknown"
class TestStrictModeBatch:
def test_all_known_no_raise(self):
results = normalise_licenses(["MIT", "Apache-2.0"], strict=True)
assert len(results) == 2
assert results[0].key == "mit"
assert results[1].key == "apache-2.0"
def test_one_unknown_raises(self):
with pytest.raises(LicenseNotFoundError):
normalise_licenses(["MIT", "no-such-license-xyz"], strict=True)
def test_non_strict_batch_with_unknown(self):
results = normalise_licenses(["MIT", "no-such-license-xyz"], strict=False)
assert results[0].key == "mit"
assert results[1].family.key == "unknown"
def test_empty_batch_strict(self):
# Empty input should not raise even in strict mode
assert normalise_licenses([], strict=True) == []
src/licence_normaliser/tests/test_integration.py
src/licence_normaliser/tests/test_integration.py
"""Comprehensive integration tests covering the full license matrix.
Each tuple: (input_string, expected_version_key, expected_license_key,
expected_family_key)
"""
import pytest
from licence_normaliser import (
LicenseNormalisationError,
LicenseNotFoundError,
LicenseVersion,
normalise_license,
normalise_licenses,
)
LICENSE_MATRIX = [
# raw,expected_key,expected_license,expected_family
# === OSI-approved licenses ===
("mit", "mit", "mit", "osi"),
("MIT", "mit", "mit", "osi"),
(" mit ", "mit", "mit", "osi"),
("apache-2.0", "apache-2.0", "apache", "osi"),
("Apache-2.0", "apache-2.0", "apache", "osi"),
("Apache 2.0", "apache-2.0", "apache", "osi"),
("Apache License 2.0", "apache-2.0", "apache", "osi"),
(
"BSD 3-Clause",
"bsd-3-clause",
"bsd-3-clause",
"osi",
), # Resolves to bsd-3-clause/osi, matches SPDX and alias entries
("bsd-3-clause", "bsd-3-clause", "bsd-3-clause", "osi"),
("BSD License", "bsd-3-clause", "bsd-3-clause", "osi"),
("MPL-2.0", "mpl-2.0", "mpl", "osi"),
("mpl-2.0", "mpl-2.0", "mpl", "osi"),
(
"Mozilla Public License 2.0",
"mpl-2.0",
"mpl",
"osi",
), # Canonical full name of MPL-2.0, matches alias entry
("ISC", "isc", "isc", "osi"),
("isc", "isc", "isc", "osi"),
("ISC License", "isc", "isc", "osi"),
("Unlicense", "unlicense", "unlicense", "osi"),
("unlicense", "unlicense", "unlicense", "osi"),
("WTFPL", "wtfpl", "wtfpl", "osi"),
("wtfpl", "wtfpl", "wtfpl", "osi"),
("Zlib", "zlib", "zlib", "osi"),
("zlib", "zlib", "zlib", "osi"),
# === GPL / AGPL / LGPL (copyleft) ===
("gpl-3.0", "gpl-3.0", "gpl-3", "copyleft"),
("GPL-3.0", "gpl-3.0", "gpl-3", "copyleft"),
("gpl-3.0+", "gpl-3.0", "gpl-3", "copyleft"),
(
"gpl-3-0",
"gpl-3-0",
"gpl-3-0",
"copyleft",
), # NOTE: hyphen instead of dot; resolver recognises gpl but doesn't normalise
("GNU GPL v3", "gpl-3.0", "gpl-3", "copyleft"),
("GPL v3", "gpl-3.0", "gpl-3", "copyleft"),
("gpl-2.0", "gpl-2.0", "gpl-2", "copyleft"),
("GPL v2", "gpl-2.0", "gpl-2", "copyleft"),
("lgpl-3.0", "lgpl-3.0", "lgpl-3", "copyleft"),
("LGPL-3.0", "lgpl-3.0", "lgpl-3", "copyleft"),
("lgpl-2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"),
("LGPL v2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"),
("lgpl v2.1", "lgpl-2.1", "lgpl-2.1", "copyleft"),
("agpl-3.0", "agpl-3.0", "agpl-3", "copyleft"),
("AGPL v3", "agpl-3.0", "agpl-3", "copyleft"),
# === Creative Commons ===
("CC BY 4.0", "cc-by-4.0", "cc-by", "cc"),
("cc by 4.0", "cc-by-4.0", "cc-by", "cc"),
("cc-by-4.0", "cc-by-4.0", "cc-by", "cc"),
("CC BY 3.0", "cc-by-3.0", "cc-by", "cc"),
("cc by 3.0", "cc-by-3.0", "cc-by", "cc"),
("cc-by-3.0", "cc-by-3.0", "cc-by", "cc"),
("CC BY 2.5", "cc-by-2.5", "cc-by", "cc"),
("CC BY 2.0", "cc-by-2.0", "cc-by", "cc"),
("CC BY 1.0", "cc-by-1.0", "cc-by", "cc"),
("cc by", "cc-by", "cc-by", "cc"),
(
"CC-BY",
"cc-by",
"cc-by",
"cc",
), # SPDX form, resolves to cc-by/cc
("CC BY-NC 4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"),
("cc by-nc 4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"),
("cc-by-nc-4.0", "cc-by-nc-4.0", "cc-by-nc", "cc"),
("CC BY-NC 3.0", "cc-by-nc-3.0", "cc-by-nc", "cc"),
("CC BY-NC-SA 4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"),
("cc by-nc-sa 4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"),
("cc-by-nc-sa-4.0", "cc-by-nc-sa-4.0", "cc-by-nc-sa", "cc"),
("CC BY-NC-SA 3.0", "cc-by-nc-sa-3.0", "cc-by-nc-sa", "cc"),
("CC BY-NC-ND 4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
("cc by-nc-nd 4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
("cc-by-nc-nd-4.0", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
("CC BY-NC-ND 3.0", "cc-by-nc-nd-3.0", "cc-by-nc-nd", "cc"),
("cc by-nc-nd 3.0", "cc-by-nc-nd-3.0", "cc-by-nc-nd", "cc"),
("CC BY-ND 4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"),
("cc by-nd 4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"),
("cc-by-nd-4.0", "cc-by-nd-4.0", "cc-by-nd", "cc"),
("CC BY-SA 4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"),
("cc by-sa 4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"),
("cc-by-sa-4.0", "cc-by-sa-4.0", "cc-by-sa", "cc"),
("CC BY-SA 3.0", "cc-by-sa-3.0", "cc-by-sa", "cc"),
("cc-by-3.0-igo", "cc-by-3.0-igo", "cc-by", "cc"),
("cc-by-nc-nd-3.0-igo", "cc-by-nc-nd-3.0-igo", "cc-by-nc-nd", "cc"),
# CC0
("CC0 1.0", "cc0-1.0", "cc0", "cc0"),
("cc0 1.0", "cc0-1.0", "cc0", "cc0"),
("cc0-1.0", "cc0-1.0", "cc0", "cc0"),
("CC0", "cc0-1.0", "cc0", "cc0"),
("cc0", "cc0-1.0", "cc0", "cc0"),
("cc-zero", "cc0-1.0", "cc0", "cc0"),
("CC Zero", "cc0-1.0", "cc0", "cc0"),
("CC-Zero", "cc0-1.0", "cc0", "cc0"),
("creative commons zero", "cc0-1.0", "cc0", "cc0"),
("Creative Commons Zero 1.0", "cc0-1.0", "cc0", "cc0"),
# CC-PDM
("cc-pdm", "cc-pdm-1.0", "cc-pdm", "public-domain"),
("CC-PDM", "cc-pdm-1.0", "cc-pdm", "public-domain"),
("cc-pdm-1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"),
("CC-PDM 1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"),
("cc-pdm 1.0", "cc-pdm-1.0", "cc-pdm", "public-domain"),
("creative commons public domain", "cc-pdm-1.0", "cc-pdm", "public-domain"),
# CC shorthand
("creative commons by", "cc-by", "cc-by", "cc"),
("creative commons by 4.0", "cc-by-4.0", "cc-by", "cc"),
(
"creative commons by-sa",
"cc-by-sa",
"cc-by-sa",
"cc",
), # Specifies by-sa, license must be cc-by-sa
(
"creative commons by-nc",
"cc-by-nc",
"cc-by-nc",
"cc",
), # Specifies by-nc, license must be cc-by-nc
(
"creative commons by-nc-sa",
"cc-by-nc-sa",
"cc-by-nc-sa",
"cc",
), # Specifies by-nc-sa, license must be cc-by-nc-sa
(
"creative commons by-nc-nd",
"cc-by-nc-nd",
"cc-by-nc-nd",
"cc",
), # Specifies by-nc-nd, license must be cc-by-nc-nd
(
"creative commons by-nd",
"cc-by-nd",
"cc-by-nd",
"cc",
), # Specifies by-nd, license must be cc-by-nd
# CC URLs
(
"http://creativecommons.org/licenses/by-nc-nd/4.0/",
"cc-by-nc-nd-4.0",
"cc-by-nc-nd",
"cc",
),
("https://creativecommons.org/licenses/by/4.0/", "cc-by-4.0", "cc-by", "cc"),
("http://creativecommons.org/licenses/by/4.0/", "cc-by-4.0", "cc-by", "cc"),
(
"https://creativecommons.org/licenses/by-nc/4.0/",
"cc-by-nc-4.0",
"cc-by-nc",
"cc",
),
(
"https://creativecommons.org/licenses/by-nc-sa/4.0/",
"cc-by-nc-sa-4.0",
"cc-by-nc-sa",
"cc",
),
(
"https://creativecommons.org/licenses/by-nd/4.0/",
"cc-by-nd-4.0",
"cc-by-nd",
"cc",
),
(
"https://creativecommons.org/licenses/by-sa/4.0/",
"cc-by-sa-4.0",
"cc-by-sa",
"cc",
),
(
"http://creativecommons.org/licenses/by-nc-nd/3.0/igo/",
"cc-by-nc-nd-3.0-igo",
"cc-by-nc-nd",
"cc",
),
(
"https://creativecommons.org/licenses/by/3.0/igo/",
"cc-by-3.0-igo",
"cc-by",
"cc",
),
("https://creativecommons.org/publicdomain/zero/1.0/", "cc0-1.0", "cc0", "cc0"),
("http://creativecommons.org/publicdomain/zero/1.0/", "cc0-1.0", "cc0", "cc0"),
# CC prose
("licensed under cc by-nc-nd 4.0 terms", "cc-by-nc-nd-4.0", "cc-by-nc-nd", "cc"),
(
"content is licensed under creative commons by-nc-sa",
"cc-by-nc-sa",
"cc-by-nc-sa", # Contains by-nc-sa, license must be cc-by-nc-sa
"cc",
),
("this content is under creative commons by license", "cc-by", "cc-by", "cc"),
# Open Data
("ODbL", "odbl", "odbl", "open-data"),
("odbl", "odbl", "odbl", "open-data"),
("Open Database License", "odbl", "odbl", "open-data"),
("ODC-BY", "odc-by", "odc-by", "open-data"),
("odc-by", "odc-by", "odc-by", "open-data"),
("PDDL", "pddl", "pddl", "open-data"),
("pddl", "pddl", "pddl", "open-data"),
(
"Open Data Commons Public Domain Dedication",
"public-domain",
"public-domain",
"public-domain",
),
# Publisher
("elsevier-oa", "elsevier-oa", "elsevier-oa", "publisher-oa"),
(
"Elsevier OA",
"elsevier-oa",
"elsevier-oa",
"publisher-oa",
), # "Elsevier OA" unambiguously identifies Elsevier OA license
("elsevier tdm", "elsevier-tdm", "elsevier-tdm", "publisher-tdm"),
("Elsevier TDM", "elsevier-tdm", "elsevier-tdm", "publisher-tdm"),
("Elsevier User License", "elsevier-oa", "elsevier-oa", "publisher-oa"),
(
"https://www.elsevier.com/open-access/userlicense/1.0/",
"elsevier-oa",
"elsevier-oa",
"publisher-oa",
),
("wiley-tdm", "wiley-tdm", "wiley-tdm", "publisher-tdm"),
("Wiley TDM", "wiley-tdm", "wiley-tdm", "publisher-tdm"),
("wiley vor", "wiley-vor", "wiley-vor", "publisher-proprietary"),
("springer-tdm", "springer-tdm", "springer-tdm", "publisher-tdm"),
(
"Springer Nature TDM",
"springernature-tdm",
"springernature-tdm",
"publisher-tdm",
),
("acs-authorchoice", "acs-authorchoice", "acs-authorchoice", "publisher-oa"),
("ACS AuthorChoice", "acs-authorchoice", "acs-authorchoice", "publisher-oa"),
(
"acs-authorchoice-ccby",
"acs-authorchoice-ccby",
"acs-authorchoice-ccby",
"publisher-oa",
),
(
"acs authorchoice cc by",
"acs-authorchoice-ccby",
"acs-authorchoice-ccby",
"publisher-oa",
),
("aps-default", "aps-default", "aps-default", "publisher-proprietary"),
("APS Default", "aps-default", "aps-default", "publisher-proprietary"),
("iop-tdm", "iop-tdm", "iop-tdm", "publisher-tdm"),
("iop copyright", "iop-copyright", "iop-copyright", "publisher-proprietary"),
("bmj copyright", "bmj-copyright", "bmj-copyright", "publisher-proprietary"),
("rsc terms", "rsc-terms", "rsc-terms", "publisher-proprietary"),
("cup terms", "cup-terms", "cup-terms", "publisher-proprietary"),
("degruyter terms", "degruyter-terms", "degruyter-terms", "publisher-proprietary"),
("tandf terms", "tandf-terms", "tandf-terms", "publisher-proprietary"),
(
"sage permissions",
"sage-permissions",
"sage-permissions",
"publisher-proprietary",
),
("wiley terms", "wiley-terms", "wiley-terms", "publisher-proprietary"),
("wiley am", "wiley-am", "wiley-am", "publisher-proprietary"),
("pnas licenses", "pnas-licenses", "pnas-licenses", "publisher-proprietary"),
(
"aaas author reuse",
"aaas-author-reuse",
"aaas-author-reuse",
"publisher-proprietary",
),
("aip rights", "aip-rights", "aip-rights", "publisher-proprietary"),
("jama cc by", "jama-cc-by", "jama-cc-by", "publisher-oa"),
("thieme nlm", "thieme-nlm", "thieme-nlm", "publisher-oa"),
("oup chorus", "oup-chorus", "oup-chorus", "publisher-oa"),
("implied oa", "implied-oa", "implied-oa", "publisher-oa"),
("implied open access", "implied-oa", "implied-oa", "publisher-oa"),
("unspecified oa", "unspecified-oa", "unspecified-oa", "other-oa"),
(
"publisher specific oa",
"publisher-specific-oa",
"publisher-specific-oa",
"publisher-oa",
),
("author manuscript", "author-manuscript", "author-manuscript", "publisher-oa"),
("open access", "other-oa", "other-oa", "other-oa"),
("other-oa", "other-oa", "other-oa", "other-oa"),
(
"all rights reserved",
"all-rights-reserved",
"all-rights-reserved",
"publisher-proprietary",
),
("no reuse", "no-reuse", "no-reuse", "publisher-proprietary"),
# Publisher prose
(
"this article is licensed under elsevier tdm agreement",
"elsevier-tdm",
"elsevier-tdm",
"publisher-tdm",
),
(
"journal article under elsevier user license for open access",
"elsevier-oa",
"elsevier-oa",
"publisher-oa",
),
(
"acs authorchoice option was selected by the authors",
"acs-authorchoice",
"acs-authorchoice",
"publisher-oa",
),
(
"springer tdm policy applies to this content",
"springer-tdm",
"springer-tdm",
"publisher-tdm",
),
# Unknown
(
"Totally Fake License XYZ999",
"totally fake license xyz999",
"totally fake license xyz999",
"unknown",
),
# Public domain
("public domain", "public-domain", "public-domain", "public-domain"),
("public-domain", "public-domain", "public-domain", "public-domain"),
("pd", "public-domain", "public-domain", "public-domain"),
]
@pytest.mark.parametrize(
"raw,expected_key,expected_license,expected_family", LICENSE_MATRIX
)
def test_license_matrix(raw, expected_key, expected_license, expected_family):
v = normalise_license(raw)
assert v.key == expected_key, f"input: {raw!r} key: {v.key!r} != {expected_key!r}"
assert v.license.key == expected_license, (
f"input: {raw!r} license: {v.license.key!r} != {expected_license!r}"
)
assert v.family.key == expected_family, (
f"input: {raw!r} family: {v.family.key!r} != {expected_family!r}"
)
def test_strict_mode_unknown_raises():
with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)):
normalise_license("xyzzy unknown license 123", strict=True)
def test_strict_mode_known_does_not_raise():
v = normalise_license("mit", strict=False)
assert v.key == "mit"
def test_empty_string_returns_unknown():
v = normalise_license("")
assert v.key == "unknown"
assert v.family.key == "unknown"
def test_whitespace_only_returns_unknown():
v = normalise_license(" \n\t ")
assert v.key == "unknown"
def test_batch_normalise_preserves_order():
inputs = ["MIT", "Apache-2.0", "CC BY 4.0", "unknown garbage"]
results = normalise_licenses(inputs)
assert [r.key for r in results] == [
"mit",
"apache-2.0",
"cc-by-4.0",
"unknown garbage",
]
def test_normalise_mit():
v = normalise_license("MIT")
assert isinstance(v, LicenseVersion)
assert v.key == "mit"
assert str(v) == "mit"
assert str(v.license) == "mit"
def test_normalise_cc():
v = normalise_license("CC BY 4.0")
assert v.key == "cc-by-4.0"
assert str(v.license) == "cc-by"
assert str(v.family) == "cc"
def test_batch():
results = normalise_licenses(["MIT", "Apache-2.0"])
assert len(results) == 2
assert results[0].key == "mit"
assert results[1].key == "apache-2.0"
def test_strict_mode_raises():
with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)):
normalise_license("Totally Fake License XYZ999", strict=True)
def test_strict_batch_raises():
with pytest.raises((LicenseNormalisationError, LicenseNotFoundError)):
normalise_licenses(["MIT", "Fake License XYZ999"], strict=True)
def test_empty_input():
v = normalise_license("")
assert v.key == "unknown"
v = normalise_license(" ")
assert v.key == "unknown"
def test_real_world_license_strings():
"""Test against real-world license strings collected from the wild."""
cases = [
("http://creativecommons.org/licenses/by-nc-nd/4.0/", "cc-by-nc-nd-4.0"),
("http://creativecommons.org/licenses/by/4.0/", "cc-by-4.0"),
("http://creativecommons.org/licenses/by-nc/4.0/", "cc-by-nc-4.0"),
(
"http://www.elsevier.com/open-access/userlicense/1.0/",
"elsevier-oa",
),
(
"http://creativecommons.org/licenses/by-nc-nd/3.0/igo/",
"cc-by-nc-nd-3.0-igo",
),
("CC BY-NC-ND 4.0", "cc-by-nc-nd-4.0"),
(
"http://creativecommons.org/licenses/by/3.0/igo/",
"cc-by-3.0-igo",
),
]
for raw, expected_key in cases:
v = normalise_license(raw)
assert v.key == expected_key, (
f"input: {raw!r} -> got {v.key!r}, want {expected_key!r}"
)
src/licence_normaliser/tests/test_models.py
src/licence_normaliser/tests/test_models.py
"""Unit tests for _models.py."""
import pytest
from licence_normaliser._models import LicenseFamily, LicenseName, LicenseVersion
__author__ = "Artur Barseghyan <artur.barseghyan@gmail.com>"
__copyright__ = "2026 Artur Barseghyan"
__license__ = "MIT"
def _cc_fam():
return LicenseFamily(key="cc")
def _osi_fam():
return LicenseFamily(key="osi")
def _cc_by_name():
return LicenseName(key="cc-by", family=_cc_fam())
def _mit_version():
return LicenseVersion(
key="mit",
url="https://opensource.org/licenses/MIT",
license=LicenseName(key="mit", family=_osi_fam()),
)
class TestLicenseFamily:
def test_str(self):
assert str(LicenseFamily(key="cc")) == "cc"
def test_repr(self):
assert repr(LicenseFamily(key="osi")) == "LicenseFamily('osi')"
def test_eq_same_type(self):
assert LicenseFamily(key="cc") == LicenseFamily(key="cc")
def test_eq_str(self):
assert LicenseFamily(key="cc") == "cc"
def test_neq(self):
assert LicenseFamily(key="cc") != LicenseFamily(key="osi")
def test_hash_usable_in_set(self):
s = {LicenseFamily(key="cc"), LicenseFamily(key="cc"), LicenseFamily(key="osi")}
assert len(s) == 2
def test_frozen_prevents_mutation(self):
fam = LicenseFamily(key="cc")
with pytest.raises((AttributeError, TypeError)):
fam.key = "other" # type: ignore
class TestLicenseName:
def test_str(self):
assert str(_cc_by_name()) == "cc-by"
def test_frozen_prevents_mutation(self):
name = _cc_by_name()
with pytest.raises((AttributeError, TypeError)):
name.key = "other" # type: ignore
def test_family_reference(self):
assert _cc_by_name().family.key == "cc"
class TestLicenseVersion:
def test_str(self):
assert str(_mit_version()) == "mit"
def test_family_shortcut(self):
assert _mit_version().family.key == "osi"
def test_frozen_prevents_mutation(self):
v = _mit_version()
with pytest.raises((AttributeError, TypeError)):
v.key = "other" # type: ignore
def test_url_stored(self):
assert _mit_version().url == "https://opensource.org/licenses/MIT"
def test_url_none(self):
v = LicenseVersion(
key="unknown",
url=None,
license=LicenseName(key="unknown", family=LicenseFamily(key="unknown")),
)
assert v.url is None
src/licence_normaliser/tests/test_prose.py
src/licence_normaliser/tests/test_prose.py
"""Tests for prose pattern matching via ProseParser."""
from licence_normaliser import normalise_license
class TestProsePatternMatching:
def test_cc_by_nc_nd_4_0_prose(self):
v = normalise_license("this work is licensed under cc by-nc-nd 4.0 terms")
assert v.key == "cc-by-nc-nd-4.0"
assert v.family.key == "cc"
def test_cc_by_nc_nd_3_0_prose(self):
v = normalise_license("license: cc by-nc-nd 3.0")
assert v.key == "cc-by-nc-nd-3.0"
assert v.family.key == "cc"
def test_cc_by_nc_sa_creative_commons_prose(self):
v = normalise_license("content licensed under creative commons by-nc-sa")
assert v.key == "cc-by-nc-sa"
assert v.family.key == "cc"
def test_attribution_prose(self):
v = normalise_license(
"this content is made available under creative commons by license"
)
assert v.key == "cc-by"
assert v.family.key == "cc"
def test_attribution_noncommercial_prose(self):
v = normalise_license(
"this article is licensed under attribution noncommercial terms"
)
assert v.key == "cc-by-nc"
assert v.family.key == "cc"
def test_attribution_sharealike_prose(self):
v = normalise_license("licensed under attribution share alike conditions")
assert v.key == "cc-by-sa"
assert v.family.key == "cc"
def test_elsevier_tdm_prose(self):
v = normalise_license(
"this journal participates in text and data mining as "
"permitted by the elsevier tdm agreement"
)
assert v.key == "elsevier-tdm"
assert v.family.key == "publisher-tdm"
def test_elsevier_user_license_prose(self):
v = normalise_license(
"elsevier user license applies to this open access article"
)
assert v.key == "elsevier-oa"
assert v.family.key == "publisher-oa"
def test_acs_authorchoice_prose(self):
v = normalise_license("acs authorchoice option was selected by the authors")
assert v.key == "acs-authorchoice"
assert v.family.key == "publisher-oa"
def test_all_rights_reserved_prose(self):
v = normalise_license("all rights reserved except as permitted by law")
assert v.key == "all-rights-reserved"
assert v.family.key == "publisher-proprietary"
def test_short_string_via_registry(self):
v = normalise_license("cc by-nc-nd")
assert v.key == "cc-by-nc-nd"
assert v.family.key == "cc"
def test_open_access_prose_matched(self):
v = normalise_license("open access article available now")
assert v.key == "other-oa"
assert v.family.key == "other-oa"
src/licence_normaliser/tests/test_publisher.py
src/licence_normaliser/tests/test_publisher.py
"""Tests for PublisherParser - publisher URLs and shorthand aliases."""
from licence_normaliser import normalise_license
class TestPublisherUrls:
def test_elsevier_oa_url(self):
v = normalise_license("https://www.elsevier.com/open-access/userlicense/1.0/")
assert v.key == "elsevier-oa"
assert v.family.key == "publisher-oa"
def test_elsevier_oa_url_http(self):
v = normalise_license("http://www.elsevier.com/open-access/userlicense/1.0/")
assert v.key == "elsevier-oa"
assert v.family.key == "publisher-oa"
def test_elsevier_tdm_url(self):
v = normalise_license("https://www.elsevier.com/tdm/userlicense/1.0/")
assert v.key == "elsevier-tdm"
assert v.family.key == "publisher-tdm"
def test_wiley_tdm_url(self):
v = normalise_license("http://doi.wiley.com/10.1002/tdm_license_1")
assert v.key == "wiley-tdm"
assert v.family.key == "publisher-tdm"
def test_wiley_terms_url(self):
v = normalise_license("https://onlinelibrary.wiley.com/terms-and-conditions")
assert v.key == "wiley-terms"
assert v.family.key == "publisher-proprietary"
def test_springer_tdm_url(self):
v = normalise_license("https://www.springer.com/tdm")
assert v.key == "springer-tdm"
assert v.family.key == "publisher-tdm"
def test_springernature_tdm_url(self):
v = normalise_license(
"https://www.springernature.com/gp/researchers/text-and-data-mining"
)
assert v.key == "springernature-tdm"
assert v.family.key == "publisher-tdm"
def test_acs_authorchoice_ccby_url(self):
v = normalise_license(
"https://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html"
)
assert v.key == "acs-authorchoice-ccby"
assert v.family.key == "publisher-oa"
def test_acs_authorchoice_url(self):
v = normalise_license(
"https://pubs.acs.org/page/policy/authorchoice_termsofuse.html"
)
assert v.key == "acs-authorchoice"
assert v.family.key == "publisher-oa"
def test_acs_authorchoice_nih_url(self):
v = normalise_license(
"https://pubs.acs.org/page/policy/"
"acs_authorchoice_with_nih_addendum_termsofuse.html"
)
assert v.key == "acs-authorchoice-nih"
assert v.family.key == "publisher-oa"
def test_rsc_terms_url(self):
v = normalise_license(
"https://www.rsc.org/journals-books-databases/"
"journal-authors-reviewers/licences-copyright-permissions/"
)
assert v.key == "rsc-terms"
assert v.family.key == "publisher-proprietary"
def test_iop_tdm_url(self):
v = normalise_license(
"https://iopscience.iop.org/info/page/text-and-data-mining"
)
assert v.key == "iop-tdm"
assert v.family.key == "publisher-tdm"
def test_bmj_copyright_url(self):
v = normalise_license(
"https://www.bmj.com/company/legal-stuff/copyright-notice/"
)
assert v.key == "bmj-copyright"
assert v.family.key == "publisher-proprietary"
def test_aaas_author_reuse_url(self):
v = normalise_license(
"https://www.science.org/content/page/science-licenses-journal-article-reuse"
)
assert v.key == "aaas-author-reuse"
assert v.family.key == "publisher-proprietary"
def test_aps_default_url(self):
v = normalise_license("https://link.aps.org/licenses/aps-default-license")
assert v.key == "aps-default"
assert v.family.key == "publisher-proprietary"
def test_aps_tdm_url(self):
v = normalise_license(
"https://link.aps.org/licenses/aps-default-text-mining-license"
)
assert v.key == "aps-tdm"
assert v.family.key == "publisher-tdm"
def test_cup_terms_url(self):
v = normalise_license("https://www.cambridge.org/core/terms")
assert v.key == "cup-terms"
assert v.family.key == "publisher-proprietary"
def test_aip_rights_url(self):
v = normalise_license(
"https://publishing.aip.org/authors/rights-and-permissions"
)
assert v.key == "aip-rights"
assert v.family.key == "publisher-proprietary"
def test_jama_cc_by_url(self):
v = normalise_license("https://jamanetwork.com/pages/cc-by-license-permissions")
assert v.key == "jama-cc-by"
assert v.family.key == "publisher-oa"
def test_oup_chorus_url(self):
v = normalise_license(
"https://academic.oup.com/journals/pages/open_access/"
"funder_policies/chorus/standard_publication_model"
)
assert v.key == "oup-chorus"
assert v.family.key == "publisher-oa"
def test_oup_terms_url(self):
v = normalise_license(
"https://academic.oup.com/pages/standard-publication-reuse-rights"
)
assert v.key == "oup-terms"
assert v.family.key == "publisher-proprietary"
def test_sage_permissions_url(self):
v = normalise_license("https://us.sagepub.com/en-us/nam/journals-permissions")
assert v.key == "sage-permissions"
assert v.family.key == "publisher-proprietary"
def test_tandf_terms_url(self):
v = normalise_license("https://www.tandfonline.com/action/showCopyRight")
assert v.key == "tandf-terms"
assert v.family.key == "publisher-proprietary"
def test_gnu_gpl_url(self):
v = normalise_license("https://www.gnu.org/licenses/gpl-3.0.html")
assert v.key == "gpl-3.0"
assert v.family.key == "copyleft"
class TestPublisherShorthand:
def test_elsevier_user_license(self):
v = normalise_license("elsevier user license")
assert v.key == "elsevier-oa"
assert v.family.key == "publisher-oa"
def test_elsevier_tdm_shorthand(self):
v = normalise_license("elsevier tdm")
assert v.key == "elsevier-tdm"
assert v.family.key == "publisher-tdm"
def test_wiley_tdm_shorthand(self):
v = normalise_license("wiley tdm license")
assert v.key == "wiley-tdm"
assert v.family.key == "publisher-tdm"
def test_wiley_vor(self):
v = normalise_license("wiley vor")
assert v.key == "wiley-vor"
assert v.family.key == "publisher-proprietary"
def test_wiley_am(self):
v = normalise_license("wiley am")
assert v.key == "wiley-am"
assert v.family.key == "publisher-proprietary"
def test_springer_tdm_shorthand(self):
v = normalise_license("springer tdm")
assert v.key == "springer-tdm"
assert v.family.key == "publisher-tdm"
def test_springer_nature_tdm_shorthand(self):
v = normalise_license("springer nature tdm")
assert v.key == "springernature-tdm"
assert v.family.key == "publisher-tdm"
def test_acs_authorchoice_shorthand(self):
v = normalise_license("acs authorchoice")
assert v.key == "acs-authorchoice"
assert v.family.key == "publisher-oa"
def test_acs_authorchoice_ccby_shorthand(self):
v = normalise_license("acs authorchoice cc by")
assert v.key == "acs-authorchoice-ccby"
assert v.family.key == "publisher-oa"
def test_acs_authorchoice_nih_shorthand(self):
v = normalise_license("acs authorchoice nih")
assert v.key == "acs-authorchoice-nih"
assert v.family.key == "publisher-oa"
def test_rsc_terms_shorthand(self):
v = normalise_license("rsc terms")
assert v.key == "rsc-terms"
assert v.family.key == "publisher-proprietary"
def test_iop_tdm_shorthand(self):
v = normalise_license("iop tdm")
assert v.key == "iop-tdm"
assert v.family.key == "publisher-tdm"
def test_iop_copyright_shorthand(self):
v = normalise_license("iop copyright")
assert v.key == "iop-copyright"
assert v.family.key == "publisher-proprietary"
def test_bmj_copyright_shorthand(self):
v = normalise_license("bmj copyright")
assert v.key == "bmj-copyright"
assert v.family.key == "publisher-proprietary"
def test_aaas_author_reuse_shorthand(self):
v = normalise_license("aaas author reuse")
assert v.key == "aaas-author-reuse"
assert v.family.key == "publisher-proprietary"
def test_pnas_licenses_shorthand(self):
v = normalise_license("pnas licenses")
assert v.key == "pnas-licenses"
assert v.family.key == "publisher-proprietary"
def test_aps_default_shorthand(self):
v = normalise_license("aps default")
assert v.key == "aps-default"
assert v.family.key == "publisher-proprietary"
def test_aps_tdm_shorthand(self):
v = normalise_license("aps tdm")
assert v.key == "aps-tdm"
assert v.family.key == "publisher-tdm"
def test_cup_terms_shorthand(self):
v = normalise_license("cup terms")
assert v.key == "cup-terms"
assert v.family.key == "publisher-proprietary"
def test_aip_rights_shorthand(self):
v = normalise_license("aip rights")
assert v.key == "aip-rights"
assert v.family.key == "publisher-proprietary"
def test_jama_cc_by_shorthand(self):
v = normalise_license("jama cc by")
assert v.key == "jama-cc-by"
assert v.family.key == "publisher-oa"
def test_degruyter_terms_shorthand(self):
v = normalise_license("degruyter terms")
assert v.key == "degruyter-terms"
assert v.family.key == "publisher-proprietary"
def test_oup_chorus_shorthand(self):
v = normalise_license("oup chorus")
assert v.key == "oup-chorus"
assert v.family.key == "publisher-oa"
def test_oup_terms_shorthand(self):
v = normalise_license("oup terms")
assert v.key == "oup-terms"
assert v.family.key == "publisher-proprietary"
def test_sage_permissions_shorthand(self):
v = normalise_license("sage permissions")
assert v.key == "sage-permissions"
assert v.family.key == "publisher-proprietary"
def test_tandf_terms_shorthand(self):
v = normalise_license("tandf terms")
assert v.key == "tandf-terms"
assert v.family.key == "publisher-proprietary"
def test_thieme_nlm_shorthand(self):
v = normalise_license("thieme nlm")
assert v.key == "thieme-nlm"
assert v.family.key == "publisher-oa"
class TestPublisherDirectKeys:
def test_elsevier_tdm_key(self):
v = normalise_license("elsevier-tdm")
assert v.key == "elsevier-tdm"
assert v.family.key == "publisher-tdm"
def test_elsevier_oa_key(self):
v = normalise_license("elsevier-oa")
assert v.key == "elsevier-oa"
assert v.family.key == "publisher-oa"
def test_wiley_tdm_key(self):
v = normalise_license("wiley-tdm")
assert v.key == "wiley-tdm"
assert v.family.key == "publisher-tdm"
def test_acs_authorchoice_key(self):
v = normalise_license("acs-authorchoice")
assert v.key == "acs-authorchoice"
assert v.family.key == "publisher-oa"
def test_acs_authorchoice_ccby_key(self):
v = normalise_license("acs-authorchoice-ccby")
assert v.key == "acs-authorchoice-ccby"
assert v.family.key == "publisher-oa"
def test_acs_authorchoice_nih_key(self):
v = normalise_license("acs-authorchoice-nih")
assert v.key == "acs-authorchoice-nih"
assert v.family.key == "publisher-oa"
def test_iop_tdm_key(self):
v = normalise_license("iop-tdm")
assert v.key == "iop-tdm"
assert v.family.key == "publisher-tdm"
def test_aps_tdm_key(self):
v = normalise_license("aps-tdm")
assert v.key == "aps-tdm"
assert v.family.key == "publisher-tdm"
def test_oup_chorus_key(self):
v = normalise_license("oup-chorus")
assert v.key == "oup-chorus"
assert v.family.key == "publisher-oa"
def test_jama_cc_by_key(self):
v = normalise_license("jama-cc-by")
assert v.key == "jama-cc-by"
assert v.family.key == "publisher-oa"
def test_thieme_nlm_key(self):
v = normalise_license("thieme-nlm")
assert v.key == "thieme-nlm"
assert v.family.key == "publisher-oa"
def test_implied_oa_key(self):
v = normalise_license("implied-oa")
assert v.key == "implied-oa"
assert v.family.key == "publisher-oa"
def test_unspecified_oa_key(self):
v = normalise_license("unspecified-oa")
assert v.key == "unspecified-oa"
assert v.family.key == "other-oa"
def test_author_manuscript_key(self):
v = normalise_license("author-manuscript")
assert v.key == "author-manuscript"
assert v.family.key == "publisher-oa"
def test_all_rights_reserved_key(self):
v = normalise_license("all-rights-reserved")
assert v.key == "all-rights-reserved"
assert v.family.key == "publisher-proprietary"
def test_no_reuse_key(self):
v = normalise_license("no-reuse")
assert v.key == "no-reuse"
assert v.family.key == "publisher-proprietary"
def test_other_oa_key(self):
v = normalise_license("other-oa")
assert v.key == "other-oa"
assert v.family.key == "other-oa"
def test_public_domain_key(self):
v = normalise_license("public-domain")
assert v.key == "public-domain"
assert v.family.key == "public-domain"
def test_open_access_key(self):
v = normalise_license("open-access")
assert v.key == "other-oa"
assert v.family.key == "other-oa"
class TestPublisherCatchAll:
def test_implied_oa_shorthand(self):
v = normalise_license("implied oa")
assert v.key == "implied-oa"
assert v.family.key == "publisher-oa"
def test_unspecified_oa_shorthand(self):
v = normalise_license("unspecified oa")
assert v.key == "unspecified-oa"
assert v.family.key == "other-oa"
def test_open_access_shorthand(self):
v = normalise_license("open access")
assert v.key == "other-oa"
assert v.family.key == "other-oa"
def test_author_manuscript_shorthand(self):
v = normalise_license("author manuscript")
assert v.key == "author-manuscript"
assert v.family.key == "publisher-oa"
def test_all_rights_reserved_shorthand(self):
v = normalise_license("all rights reserved")
assert v.key == "all-rights-reserved"
assert v.family.key == "publisher-proprietary"
def test_no_reuse_shorthand(self):
v = normalise_license("no reuse")
assert v.key == "no-reuse"
assert v.family.key == "publisher-proprietary"
class TestCCPublicDomain:
def test_cc_pdm_bare_key(self):
v = normalise_license("cc-pdm")
assert v.key == "cc-pdm-1.0"
assert v.family.key == "public-domain"
def test_cc_pdm_versioned_key(self):
v = normalise_license("cc-pdm-1.0")
assert v.key == "cc-pdm-1.0"
assert v.family.key == "public-domain"
def test_cc0_bare_key(self):
v = normalise_license("cc0")
assert v.key == "cc0-1.0"
assert v.family.key == "cc0"
def test_cc0_versioned_key(self):
v = normalise_license("cc0-1.0")
assert v.key == "cc0-1.0"
assert v.family.key == "cc0"
def test_cc_zero_shorthand(self):
v = normalise_license("cc-zero")
assert v.key == "cc0-1.0"
assert v.family.key == "cc0"
def test_public_domain_fallback(self):
v = normalise_license("public-domain")
assert v.key == "public-domain"
assert v.family.key == "public-domain"
def test_creative_commons_zero(self):
v = normalise_license("creative commons zero")
assert v.key == "cc0-1.0"
assert v.family.key == "cc0"
def test_creative_commons_public_domain(self):
v = normalise_license("creative commons public domain")
assert v.key == "cc-pdm-1.0"
assert v.family.key == "public-domain"