Home›Methodology

Data Methodology

How we discover facilities, extract verified monthly pricing, and decide what to publish.

1. Discovery

Active Canadian self-storage facilities reach the directory through four sources:

Google Places API — nearbysearch + Place Details across major Canadian markets.
Chain sitemaps — direct enumeration of locations from Access Storage, StorageMart, Sentinel, Apple Self Storage, Public Storage Canada, U-Haul, Vaultra, XYZ Storage, Storguard, BigSteelBox, Dymon, SmartStop and others.
Yelp Fusion — cross-reference for reviews and contact details.
Competitor directories — FindStorageFast.ca for entity verification.

Records from multiple sources are entity-resolved by Google Place ID, normalized website URL, coordinate proximity (50 m), and phone-number match. When coordinates are missing, phone-only matches are routed to a Claude judgment step rather than auto-merged.

2. Pricing extraction — the cascade

For each facility, the pipeline tries a layered cascade. The first layer to return plausible monthly pricing wins; everything after it short-circuits. Chain-specific layers run first because they're free and exact; the generic LLM extractor catches the long tail.

Chain extractors — direct API or rendered-state calls for SSTG platforms (Access Storage, Sentinel — converted from weekly to monthly), StorageMart (Apollo state), Akiba (Vaultra, Storguard, Diverse — discovered from the embedded <akiba-widget> tag), CubbyStorage (Apple Self Storage), XYZ Storage, U-Haul, Public Storage Canada, and StorEdge rental-center.
Network interception — Playwright captures any JSON XHR responses operator pages fire during render; a recognizer walks them for size/rate pairs. This catches independents on SaaS booking platforms even when we don't have a per-chain extractor.
LLM extraction — for the long tail of independent operators with no recognized API: render with Playwright, send the rendered text to Claude with explicit instructions to capture only standard monthly rates (promotional and weekly rates are explicitly converted or excluded), and return structured size+rate pairs along with the facility's URL-page classification.

3. Monthly normalization

Different operators publish prices in different units. Access Storage and Sentinel quote per-week rates; CubbyStorage internally uses weekly cents; most others publish monthly. The pipeline enforces a single contract: every stored value is a monthly rate.

Weekly rates are multiplied by 52/12 (≈4.33).
Four-week-cycle rates by 13/12 (≈1.08).
A frequency guard runs on every extractor's output: rates that systematically fall below per-size monthly floors but would pass after × 52/12 are rejected as weekly-disguised-as-monthly, and the cascade falls through to the next layer. This prevents silently shipping wrong data.

4. Confidence tiers

HIGHVerified

Direct chain-API hit or operator API JSON response (e.g. SSTG _next/data, Akiba units endpoint, StorageMart Apollo state, CubbyStorage API).

MEDIUMEstimated

LLM-extracted from rendered page text, or text-scraped from a chain that publishes rates as plain text (PSC categorical, U-Haul Available Units, StorEdge rental-center).

LOWApproximate

Categorical 'starting from' copy mapped to canonical sizes (e.g. PSC's small/medium/large). Carries the operator's published number but lower size-resolution than HIGH/MEDIUM.

A fourth internal tier, SIMULATED, marks provincial-average fallback rows that were generated by an earlier enrichment when no real pricing existed. SIMULATED rows are filtered out at the API layer and never reach the public-facing pages.

5. Per-facility classification

Every audited facility gets a classification that explains how the public page should treat it:

pricing_extracted — operator publishes prices online and we successfully scraped them. The facility page shows the pricing table and a Book/View Units CTA that points to the location-specific URL.
no_published_price — operator's page is about this facility but doesn't show prices (quote-only). The pricing table shows "Pricing not yet available" and the external CTA is hidden.
needs_url_review — the URL we have leads to a generic or wrong page. Flagged for manual correction; the live CTA is hidden.
interaction_required — pricing exists but only after multi-step booking flow we can't complete automatically.
dead_url / wrong_business — domain is dead or now belongs to a different business. These rows are removed from the active set.

6. Update cadence

Chain extractors run weekly and refresh ~40% of the active set in minutes. The LLM cascade re-audits the rest on a rolling cadence keyed off last_audited_at — typically every 14 days or sooner for facilities flagged for URL review. New facilities appear within 7–14 days of Google Places indexing them.

7. Corrections

Operators can submit corrections through the claim-listing form. Independent corrections from users are welcomed — reach out and we'll investigate within 48 hours.

Submit a correction →