FAQ

Frequently asked questions about Scrape Exchange.

Why?

Many companies and researchers are already scraping public data. There is an ongoing tussle between platforms and scrapers, with platforms trying to block scrapers and scrapers trying to evade blocks. Scrape Exchange provides a neutral ground for sharing scraped metadata, reducing redundant scraping efforts, lowering traffic to platforms, and enabling data reuse.

How is data verified?

Every upload requires a source URL. We pick random uploads to verify against their live sources and if we find discrepancies, we take appropriate action, which may include suspending an account. Downloaders should treat data as "as-is" and verify against live sources as needed.

Data retention policy?

Data is stored indefinitely.

Can I upload media files?

No, we only accept metadata. Storing media files would significantly increase costs and create legal exposure.

Is this legal?

Scraping public and/or user-generated data is generally legal, but users scraping a website are responsible for complying with platform terms of service and local laws.

Can I get live updates?

Yes, we offer a WebSocket feed for real-time notifications of new uploads, either with or without the content metadata. For an example, check out the listener tool in the scrape-python repository.

How can I report an issue or request a feature?

You can file an issue on Github, or if that is not your thing you can contact us on Discord or on Reddit in the r/ScrapeExchange subreddit.

Is it free?

Yes, the platform is completely free to use. There are no fees for uploading or downloading data, and we do not have any plans to introduce paid features in the future. Our mission is to democratize access to social media metadata and we believe that keeping the platform free is essential to achieving that goal. We may start pestering you about donations in the future to help cover costs, but that will be completely optional and will not affect your access to the platform in any way.

What is the directory lay-out in the data dumps?

The directory structure is organized by first the version of the directory structure ('v1' for now), then the schema the data refers to, then the username of the uploader, and finally the platform_creator_id or platform_topic_id. In that final directory are all the files meeting those parameters. This allows you to easily select what files you want to download from the torrent.

Why is the website slow or down?

The Scrape.Exchange setup is designed to be cheap and still scalable. If you scrape and upload data, have your tooling save the scraped data to a file and then upload it. If the upload fails, your tool should try again at a later time using the saved data. For downloading, use the torrent files. As long as enough people are seeding the torrents, the download should be available all the time.

Why won't curl download a data file?

All data is stored in separate files and compressed with Brotli. You can install brotli with sudo apt install brotli on Debian-based systems, sudo yum install brotli on Red Hat-based systems, or brew install brotli on macOS. For downloading you have a couple of options:

—Specify the Accept-Encoding: br header in your curl request to get the compressed file, then decompress it with brotli.
—Add the .br extension to the URL to get the compressed file directly, e.g. curl https://scrape.exchange/data/12345.br -o data.json.br, then decompress it.

Why do I need to download each data file separately?

We prefer you download archives via torrents rather than individual files. This allows us to keep costs low and make the platform sustainable. Storing data in separate files also means we can serve downloads directly from object storage (e.g., AWS S3) without needing to spin up expensive servers to handle large file generation on the fly.

What platforms does Scrape.Exchange support?

Scrape.Exchange supports platforms that are dominant in their market and that collect user-generated data. The list of accepted data is currently limited to YouTube, TikTok, Twitch, Kick, Facebook, Instagram, X, Telegram, Threads, and Reddit. If you believe that another platform should be added then please let us know on Discord or Reddit.

How do I upload data?

You can upload data on the Upload page or programmatically via the API. Sign up for a free account to get API keys. For examples in Python, check out the tools in the scrape-python repository. Whenever possible, use one of the existing JSON schemas — it reduces complexity for people downloading the data you contributed. If you are scraping data not yet covered by an existing schema, you can upload new JSON Schemas.

Is there documentation on the API?

Yes, the API documentation is available at this link.

What is a JSONSchema?

JSON Schema is a vocabulary for describing and validating the structure of JSON data. It defines what fields a JSON object must or may contain, what types those fields should be, and what constraints they must satisfy (e.g. minimum length, allowed values, required vs optional). Scrape.Exchange uses JSON Schemas because it makes uploaded data more consistent and so easier to consume.

How do I create a JSONSchema?

On the Schema page you can submit a sample of your data and have a JSONSchema generated for you. You can then upload that schema.

Programmatically, assuming you are using Python, the easiest way to create a JSON Schema is to use the pydantic module. You can create a class derived from pydantic.BaseModel and use type hints to describe the data structure. You can then use the model_json_schema() method to generate a JSON Schema. For example:

from pydantic import BaseModel, Field
from typing import Optional

class Address(BaseModel):
    street: str
    city: str
    postal_code: str = Field(pattern=r"^\d{4}[A-Z]{2}$")  # Dutch format

class User(BaseModel):
    id: int
    name: str = Field(min_length=1, max_length=100)
    email: str
    age: Optional[int] = Field(default=None, ge=0, le=130)
    address: Address

# Generate the JSON Schema
import json
print(json.dumps(User.model_json_schema(), indent=2))

To make scrape.exchange maintain and display counters, similar to how we maintain counters for YouTube channels and videos, you can add keys whose name ends with '_count' and have an integer or number type. We will automatically increment those counters when we see new data that matches the schema and has the same platform_creator_id or platform_topic_id.

Why is my schema rejected by the API?

Schemas must meet the following requirements:

—Written against the JSONSchema specification draft 2020-12
—Describes an entity type on one of the supported platforms (e.g. YouTube channel, TikTok video, Instagram comment)
—Self-contained — no external $refs; URLs may only point to jsonschema.org or scrape.exchange domains
—Versioned with semantic versioning (semver), following the naming convention {schema_owner}_{platform}_{entity_type}_{semver}
—Valid according to the JSONSchema specification and passing our validation checks (no circular references, within size limits, etc.)