FAQ
Frequently asked questions about Scrape Exchange.
Why?
Many companies and researchers are already scraping public data. There is an ongoing tussle between platforms and scrapers, with platforms trying to block scrapers and scrapers trying to evade blocks. Scrape Exchange provides a neutral ground for sharing scraped metadata, reducing redundant scraping efforts, lowering traffic to platforms, and enabling data reuse.
How is data verified?
Every upload requires a source URL. We may verify random uploads against live sources and if we find discrepancies, we take appropriate action, which may include suspending an account. Downloaders should treat data as "as-is" and verify against live sources as needed.
Data retention policy?
Data is stored indefinitely.
Can I upload media files?
No, we only accept metadata. Storing media files would significantly increase costs and create legal exposure.
Is this legal?
Scraping public or user-generated data is generally legal, but users scraping a website are responsible for complying with platform terms of service and local laws.
Can I get live updates?
Yes, we offer a WebSocket feed for real-time notifications of new uploads, either with or without the content metadata. For an example, check out the listener tool in the scrape-python repository.
How can I report an issue or request a feature?
Is it free?
Yes, the platform is completely free to use. There are no fees for uploading or downloading data, and we do not have any plans to introduce paid features in the future. Our mission is to democratize access to social media metadata and we believe that keeping the platform free is essential to achieving that goal. We may start pestering you about donations in the future to help cover costs, but that will be completely optional and will not affect your access to the platform in any way.
What is the directory lay-out in the data dumps?
The directory structure is first organized by the schema the data refers to, then the username of the uploader, and finally the platform_creator_id or platform_topic_id. In that final directory are all the files meeting those parameters. This allows you to easily select what files you want to download from the torrent.
Why is the website slow or down?
The Scrape.Exchange setup is designed to be cheap and still scalable. If you scrape and upload data, have your tooling save the scraped data to a file and then upload it. If the upload fails, your tool should try again at a later time using the saved data. For downloading, use the torrent files. As long as enough people are seeding the torrents, the download should be available all the time.
Why won't curl download a data file?
All data is stored in separate files and compressed with Brotli. You can install brotli with
sudo apt install brotli on Debian-based systems, sudo yum install brotli on Red Hat-based systems, or brew install brotli on macOS. For downloading you have a couple of options:- —Specify the
Accept-Encoding: brheader in your curl request to get the compressed file, then decompress it withbrotli. - —Add the
.brextension to the URL to get the compressed file directly, e.g.curl https://scrape.exchange/data/12345.br -o data.json.br, then decompress it.
Why do I need to download each data file separately?
We prefer you download archives via torrents rather than individual files. This allows us to keep costs low and make the platform sustainable. Storing data in separate files also means we can serve downloads directly from object storage (e.g., AWS S3) without needing to spin up expensive servers to handle large file generation on the fly.
How do I upload data?
You can upload data programmatically via the API. Sign up for a free account to get API keys. For examples in Python, check out the tools in the scrape-python repository. Whenever possible, use one of the existing JSON schemas — it reduces complexity for people downloading the data you contributed. If you are scraping data not yet covered by an existing schema, you can upload new JSON Schemas.
Why is my schema rejected by the API?
Schemas must meet the following requirements:
- —Written against the JSONSchema specification draft 2020-12
- —Describes an entity type on one of the supported platforms (e.g. YouTube channel, TikTok video, Instagram comment)
- —Self-contained — no external $refs; URLs may only point to jsonschema.org or scrape.exchange domains
- —Versioned with semantic versioning (semver), following the naming convention {schema_owner}_{platform}_{entity_type}_{semver}
- —Valid according to the JSONSchema specification and passing our validation checks (no circular references, within size limits, etc.)