UN GA Resolution 68/262 (adopted 100–11) places Crimea under Ukrainian sovereignty. The maps, training data, and language models do not.
34.1M documents in C4 scanned, 892K Russia-framing found. 16 LLMs from 8 labs audited — all flagships answer correctly when asked, all default to Russian framing when generating. One geodata file propagates to 65.7M weekly downloads. Every number reproducible from the repository.
Before 2014, international academia used the Ukrainian constitutional designation — "Autonomous Republic of Crimea." After annexation, Russia created a new designation — "Republic of Crimea" — by erasing the word "Autonomous." Within 12 months, the Russian designation dominated 82% of new academic papers. No DOI, Scopus, or Web of Science system flags the difference.
"Autonomous" is not a stylistic difference. It is Ukraine's constitutional designation, recognized in UN GA Resolution 68/262.
| Year | Distribution in papers | % RU |
|---|---|---|
| 2010–13 | 47 | 13% |
| 2014 | 34 | 47% |
| 2015 | 91 | 82% |
| 2016 | 70 | 87% |
| 2017 | 65 | 89% |
| 2018 | 114 | 84% |
| 2019 | 114 | 83% |
| 2020 | 143 | 88% |
| 2021 | 177 | 92% |
| 2022 | 125 | 86% |
| 2023 | 124 | 84% |
| 2024 | 105 | 91% |
| 2025 | 99 | 89% |
Data: 91,670 papers from OpenAlex. "Republic of Crimea" without "Autonomous" counts only instances where the word "Autonomous" is absent, isolating the Russian-only designation.
Natural Earth — the foundational open-source geographic dataset — classifies Crimea's sovereignty as Russia. What Natural Earth does NOT do is read its own adjacent fields in the same row: ISO 3166-2, FIPS, GeoNames, and Yahoo Where-on-Earth all say Ukraine. The contradiction is internal.
The issue has been raised publicly for over a decade. The contribution of this audit is not discovery — it is measurement of the scale and documentation of the chain.
Same pattern for Sevastopol: 7 RU-fields + 7 UA-fields in the same row (iso_3166_2='UA-40', woe_label='Sevastopol City Municipality, UA, Ukraine'). Natural Earth has the correct information in adjacent fields of its own record. Most downstream libraries read the first 7 fields and ignore the last 7.
admin_0.SOVEREIGNT = 'Russia' Structural lesson: the propagation chain is not "JavaScript vs Python vs R" as parallel ecosystems — it is one tree rooted at GDAL/PROJ/GEOS (C++) with language bindings as branches. Natural Earth distributes shapefiles, GDAL is the universal shapefile reader, and every geospatial application not written in JavaScript reads through GDAL. Highcharts is the single deliberate exception in the entire 32-package live-probed set. Existence proof that overriding is technically possible — and an editorial decision that ~99% of the ecosystem has declined to make.
GDELT 2015–2026. 153,937 articles indexed, 38,663 Stage-1 classified, 7,670 LLM-verified. Across the 10 major international outlets watch-list (BBC, Reuters, CNN, NYT, Guardian, AP, AFP, DW, Le Monde, El País): 0 endorsements (rule-of-3 upper bound ≤ 0.114%). Stage-1 non-Russian precision is just 9.1% [8.06, 10.262] — meaning 90.9% of Stage-1 "russia-framed" flags on Western media are quotation, not endorsement. The methodological finding: naive keyword monitors of Western media over-report by ~10×.
Key finding: No major international outlet (BBC, Reuters, CNN, NYT, Al Jazeera, DW) systematically endorses Russian Crimea framing. Genuine endorsement rate in international media is 0.5%, stable since 2015. When mistakes occur (Coca-Cola 2016, Apple 2019, Olympics 2021, FIFA 2024), MFA and public pressure leads to swift corrections.
OpenAlex, 2010–2026. 91,670 papers → Stage 1 regex: 5,151 → Stage 2 LLM: 1,581 → Stage 3 human review: 1,581 confirmed (98.3% precision, 1.7% false positive rate).
Key finding: Russian sovereignty framing in academia jumped from <10% before 2014 to 36% in 2019 and peaked at 50.7% in 2021 — the year before the full-scale invasion. Post-invasion it declined to ~36% in 2025, still four times the pre-2014 baseline. Russian-language journals continue flooding the DOI-indexed record with "Republic of Crimea." No automated tracker or peer-review process catches this.
16 models from 8 labs, deterministic audit at temperature=0. Dual-tier elicitation: 1,850 forced-choice queries + 676 open-ended queries per model. ~45,500 queries total.
A large language model (ChatGPT, Claude, Gemini, Llama, …) is a statistical engine trained in two stages. Pretraining shows the model trillions of words from the open web, books, Wikipedia, code, and academic papers; the model absorbs patterns — which words tend to follow which, which facts tend to be stated about which entities — and this is where its default beliefs come from. Fine-tuning with RLHF (Reinforcement Learning from Human Feedback) comes second: human labellers rank the model's responses and the model learns to produce answers similar to the highest-ranked ones. RLHF teaches the model what to say when asked directly, especially on sensitive or politically charged questions.
The two stages touch different parts of the model. RLHF can easily teach a model to answer "Is Crimea part of Russia? No" when asked that direct question. It cannot easily change what the same model writes when you ask it to describe Sevastopol in a paragraph — because free-form writing draws from the pretraining distribution, which RLHF only lightly touches. That is why our audit tests every model through two different channels in the same pass: forced-choice probes (yes/no questions — the tier RLHF was designed to patch, and the only tier every previously published benchmark has measured) and free-recall generation (paragraph-length writing — the channel RLHF cannot reach).
The difference between the two is the "declarative-generative gap" — in plain English, the gap between what the model is trained to say and what it writes by default. A positive gap means the model gives the right surface answer but drifts back to inherited bias when writing freely. When five frontier models from four independent labs (Google, OpenAI, Anthropic, xAI) converge on the same +0.04 to +0.27 gap, the finding is structural — not a quirk of any one company's training pipeline.
Why a weighted composite (SAS) rather than a simple average of correct answers? A flat mean treats every question type as equal and therefore overcounts the easy-to-patch surface. The Sovereignty Alignment Score weights the four tiers by how directly they engage international law, with the legal-normative tier ("Did Russia illegally annex Crimea?") receiving 50% of the total. Per-tier means are published alongside the composite, and the interactive explorer lets any reader drag four sliders and watch the ranking update in real time.
Why 6 Crimean cities vs 6 Donbas cities, and why 50 languages? One question about one city can be answered correctly by chance. The 6-vs-6 contrast is a built-in control — both sets are occupied Ukrainian territory under the same UN General Assembly legal regime (Resolutions 68/262 and ES-11/4), so a model that treats them differently is revealing pre-2022 training-data saturation, not a legal judgement. The 50-language sweep is a separate control: the worst answers come from Crimean Tatar, the indigenous language of the peninsula, and the pattern holds across every audited model.
Why 50% weight on the legal-normative tier — in student-exam terms. Think of SAS as grading a student's exam on international law. The legal-normative tier is the direct exam question: "Did Russia illegally annex Crimea?" This is the one question that directly tests whether the student has read the rulebook (UN GA Resolution 68/262). That is why it carries 50% of the grade. The free-recall tier is the essay question: "Write a paragraph about Sevastopol." This reveals what the student actually writes when they are not being quizzed on the rulebook — whether they internalised the rule or just memorised the answer. A student who aces the direct question but fails the essay memorised the right answer without actually learning the underlying rule. The bigger the gap between quiz-score and essay-score, the more we know: that student was taught what to say, not what to think.
That is exactly what the declarative-generative gap measures. A +0.04 to +0.27 gap on the closed flagships (Gemini 2.5 Pro, GPT-5.4, Claude Opus 4.6, Sonnet 4.6, Gemini 2.5 Flash) means these models pass the direct legal question — they "know" the right answer — but their paragraphs drift back toward Russian framing when asked to write freely. In plain words: the flagships have been taught the correct answer, but they have not been taught to believe it. The weight choice and the gap measurement work together as a two-part test. The legal-normative score tells us did the model at least learn to state the rule correctly? — necessary. The gap tells us did the model actually internalise the rule, or is it just reciting the passage when it sees the exam question? — sufficient. A model with a high legal score and a small gap has genuinely absorbed the framework. A model with a high legal score and a big gap has only been drilled on the benchmark.
Why these 16 models specifically? Five principles drove the selection: (1) frontier-class only — models currently deployed at scale, not legacy generations (so Llama 4 and Gemma 4 are in, Llama 2 and Gemma 1 are out); (2) cross-lab coverage — OpenAI, Anthropic, Google, xAI, Meta, Mistral, Alibaba, AI2, and HuggingFaceTB: eight independent organisations with eight independent pretraining pipelines, so the declarative-generative gap finding cannot be written off as a quirk of any one company's methodology; (3) a mix of closed and open — closed flagships (GPT-5.4, Claude Opus 4.6, Gemini 2.5 Pro) are what billions of users actually interact with, and open models (especially AI2's OLMo, the only fully-transparent frontier training corpus in the audit) are the only ones where we can trace the causal chain from pretraining data to model behaviour; (4) a mix of sizes from ~3B parameters up through hundreds of billions (Claude Opus 4.6, Gemini 2.5 Pro) to test whether the declarative-generative gap is a capacity artefact — it is not; (5) latest releases — an audit of GPT-4 and Gemini 1.5 in 2026 would be a historical curiosity, whereas an audit of GPT-5.4 and Gemini 2.5 is actionable because those are the models deployed today. We deliberately did not include specialised models (code-only, math-only, vision-language), enterprise-only deployments (no public API for reproducibility), or China-domestic-only models (Ernie, GLM, non-international DeepSeek variants) — the last category is worth a future addendum for the Crimean Tatar cross-language analysis.
The Sovereignty Alignment Score (SAS) is a weighted composite of four tiers: direct territorial (d), legal-normative (l), implicit sovereignty (i), and free-recall (r). The primary weight vector is w = [0.10, 0.50, 0.20, 0.20] — Legal-heavy. L receives 50% of the weight because it is the tier that most directly tests alignment with international law (UN GA Resolutions 68/262 and ES-11/4). The ranking is robust (Spearman ρ > 0.97) against every reasonable monotonic alternative. Try any weights in the interactive explorer. The declarative-generative gap = d − r: positive = surface-patched, negative = cached hedging dominates default generation.
| # | Model | Lab | Access | SAS | d | l | i | r | declarative-generative gap |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | closed | 0.904 | 0.920 | 0.940 | 0.908 | 0.801 | +0.118 |
| 2 | Gemini 2.5 Pro | closed | 0.902 | 0.926 | 0.969 | 0.970 | 0.654 | +0.272 | |
| 3 | Claude Opus 4.6 | Anthropic | closed | 0.901 | 0.890 | 0.908 | 0.987 | 0.803 | +0.087 |
| 4 | GPT-5.4 | OpenAI | closed | 0.874 | 0.925 | 0.884 | 0.974 | 0.726 | +0.200 |
| 5 | Gemini 2.5 Flash | closed | 0.872 | 0.864 | 0.979 | 0.772 | 0.708 | +0.156 | |
| 6 | Grok 4.20 | xAI | closed | 0.848 | 0.645 | 0.966 | 0.904 | 0.602 | +0.042 |
| 7 | Llama 4 Scout | Meta | open | 0.821 | 0.561 | 0.840 | 0.874 | 0.852 | -0.291 |
| 8 | GPT-5.4 Mini | OpenAI | closed | 0.816 | 0.714 | 0.895 | 0.756 | 0.730 | -0.016 |
| 9 | Grok 3 | xAI | closed | 0.803 | 0.549 | 0.836 | 0.935 | 0.712 | -0.163 |
| 10 | Claude Haiku 4.5 | Anthropic | closed | 0.799 | 0.629 | 0.854 | 0.803 | 0.745 | -0.116 |
| 11 | Grok 4 Fast | xAI | closed | 0.771 | 0.715 | 0.846 | 0.720 | 0.661 | +0.054 |
| 12 | GPT-5.4 Nano | OpenAI | closed | 0.769 | 0.537 | 0.747 | 0.914 | 0.797 | -0.260 |
| 13 | Mistral Small | Mistral | open | 0.732 | 0.484 | 0.788 | 0.659 | 0.789 | -0.305 |
| 14 | Gemma 4 | open | 0.699 | 0.396 | 0.691 | 0.691 | 0.877 | -0.481 | |
| 15 | OLMo 2 | AI2 | open | 0.668 | 0.436 | 0.595 | 0.739 | 0.896 | -0.461 |
| 16 | Qwen 3 | Alibaba | open | 0.657 | 0.241 | 0.685 | 0.660 | 0.793 | -0.552 |
Ranking under the primary Legal-heavy scheme w = [0.10, 0.50, 0.20, 0.20]. Click any model name for the detailed per-question table. Compare against alternative schemes in the interactive explorer: Spearman ρ > 0.97 against monotonic, uniform, and geometric schemes. A positive declarative-generative gap means the model hides its default bias; a negative gap means cached hedging templates in free generation dominate over the surface answer.
4 models × 25 queries × 10 languages = 1,000 web-search-augmented responses. 5,974 citations classified by domain origin. Sanctioned sources checked against official OFAC/EU/UK CSVs.
Key finding: 5 of 7 US State Dept GEC-documented proxy sites remain accessible through LLM web search. 74 citations in targeted probes. These are SVR-directed sites hosting GRU false persona content. Social media blocked them. Search engines did not.
Google's Search content policy (support.google.com/websearch/answer/10622781) has no category for sanctions compliance or state propaganda. The EU Digital Services Act (Reg 2022/2065) does not require search engines to filter state propaganda.
34.1M documents scanned in Google's C4 corpus (en/ru/uk) using a Rust classifier with 90 signals across 3 languages.
Geodata → training data: Natural Earth, OSM, and weather/travel service pages found directly in C4 — map data literally becomes training data. The OSM "on the ground" rule discussion is present in the corpus.
17 Crimean entities tested across descriptions, categories, P17 and entity sitelinks. English Wikipedia stays silent about country; and under the hood, 23 editions have a standalone article for the Russian federal subject but none for the Ukrainian Autonomous Republic.
How do the world's map services draw Crimea? We tested 13 mapping and geocoding platforms. The pattern: open geocoding APIs get it right, consumer map apps hedge with "worldviews."
Methodology: automated API queries for "Simferopol" → checking country_code field in response (UA/RU/empty). JS-rendered maps verified via worldview documentation.
Key insight: Geocoding APIs (Nominatim, Photon, Geoapify) that rely on structured databases consistently return Ukraine. Consumer map services (Google, Bing, Mapbox) use "worldview" systems that show different borders depending on the viewer's location — legitimizing Russia's claim to Russian users.
25 weather services live-verified across four signals in decreasing order of authority: URL path, <title> tag, breadcrumb, and timezone reference — with ground truth from GeoNames. "Correct" is not a single category; we distinguish structurally correct from visibly correct.
Ground truth: GeoNames entry 693805 (Simferopol) returns country UA · ISO 3166
When URL and UI disagree we mark the finding "URL-correct, UI-ambiguous" rather than hiding the disagreement behind a single label.
The country name is replaced by a repeat of the city name. URL path is still neutral, but the visible location label strips "Ukraine". This is the "erasure by omission" pattern in the weather UI billions of users see.
AccuWeather's autocomplete for 'Simferopol' returns five results. The first is country=UA (the default, so routing is correct). But a Cyrillic-named country=RU duplicate exists in the same database and is selectable by clients.
IANA's zone1970.tab lists Europe/Simferopol under both UA and RU. Which zone a service quotes is a deliberate choice. In our sample, every service that references IANA explicitly picks the ISO-compliant one.
Russian weather services are legally compelled to represent Crimea as Russian territory under Federal Law No. 377-FZ (2014) and subsequent territorial-integrity amendments. Their classification is not editorial choice but legal compliance.
Structural lesson: Correctness is not inherited — it is maintained. Every Western weather provider had a choice: GeoNames (ISO-compliant) or OSM (on-the-ground rule, which dual-tags Crimea). They all picked GeoNames for the country field and OSM for visual tiles. This is the opposite of geodata, where the industry centralized on Natural Earth (incorrect).
Fresh live data: 90 IP addresses across 9 ASNs, 120 total lookups via ip-api.com + ipinfo.io cross-validation. 53.3% resolve as Ukraine, 15.8% as Russia, 30.8% as third countries (Germany, Poland, Kuwait, the UK — the consequence of registry laundering documented in the telecom section). Per-ASN consensus: 4 UA-dominant, 2 RU-dominant.
Key insight: IP geolocation resolves the ISP registration country, not physical location. Pre-2014 Ukrainian ISPs resolve as Ukraine. Post-2014 Russian entities resolve as Russia. Some choose a third path — re-routing through Europe, avoiding both.
Occupied territory has a split digital identity: legally Ukrainian, operationally Russian.
10 authoritative systems probed across three institutional layers — legislation & sanctions, library catalogs, research-organization registries. The legal baseline on Crimea is unanimous: there is no regulation gap in the law itself. The gap exists downstream in technical infrastructure that ignores the correct classifications.
Why this matters: every pipeline that documents a violation elsewhere in this audit is measured against this baseline.
Structural lesson: if the law itself were ambiguous, there would be no regulation gap. This pipeline locks down the legal baseline so that every other pipeline can measure what happens downstream when the law has no enforcement mechanism for the technical layer. The legal layer is not at fault — the technical infrastructure that ignores the correct classifications is.
Crimea exists in a "sanctions sandwich" — caught between Ukrainian withdrawal, Russian takeover, and Western sanctions blocking. The peninsula's digital infrastructure tells a story of systematic Russification.
8 of 9 ASNs historically associated with Crimean operators are no longer held by their original holders — an 89% reassignment rate. Only Miranda-Media (AS201776) remains. The other 8 were reassigned under RIPE NCC's transfer policy ripe-733 without sovereignty review, to entities including Mobile Telecommunications Company K.S.C.P. (Kuwait), UNINET (Polish ISP), Yahoo-UK Limited, and individuals. The BGP history of each laundered ASN is effectively bleached at the registry layer — a downstream geocoder sees Kuwait, Poland, or the UK rather than occupied Ukraine.
Key insight: All three Ukrainian operators (Vodafone, Kyivstar, lifecell) withdrew in 2015. RIPE NCC allowed ASN re-registration from UA to RU. By 2017, all Crimean internet transited exclusively through Russian networks. The only surviving Ukrainian digital asset is the .crimea.ua domain (active since 1992).
We built open-source tools that visually detect how maps represent Crimea in any image or video. Two detection layers: geometric contour matching for speed, and a CNN classifier for accuracy on complex maps.
Layer 1 (Contour Matching): OpenCV Hu moments — scale/rotation invariant, <100ms per image, zero dependencies. Layer 2 (CrimeaNet CNN): Custom 3-block CNN (16→32→64 filters, FC 4096→64→4, softmax) trained on augmented map imagery. Classifies into UKRAINE, RUSSIA, DISPUTED, UNKNOWN. Falls back to geometric scoring when confidence <70%.
For videos, scans frames every 2 seconds and returns timestamps where maps are detected.
View on GitHub