This is the site they were scraping · aimag.me · Free AI tarot readings
Active Threat

Is your content
training Chinese AI?

How to tell. What to do. Costs you nothing.

01

Check your logs right now

Pull your access logs and look for this pattern. If 3 or more match, you're being scraped for AI training data. Probably right now.

1
IP belongs to a datacenter, not a residential ISP
Look for Alibaba, Tencent, Huawei, AWS, Azure ranges. Real users come from Comcast, Vodafone, T-Mobile — not cloud providers.
2
Single GET request to a content page
One request, one page, then the IP is never seen again. A real user browses multiple pages per session. A scraper grabs one and rotates.
3
Zero follow-up requests for assets
No .css .js .png .woff2 loaded. They want your text, not your design. A browser without CSS is a bot without a costume.
4
User-Agent looks like a normal browser
Rotating between Chrome/134 Safari/15 Firefox/135 and even mobile UAs. A datacenter IP claiming to be an iPhone is not an iPhone.
5
No referer header
Real visitors come from Google, social media, bookmarks with referers. Scrapers hit URLs directly from a list. No search query. No social click. Just raw URL access.
# Quick check — run on your server awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20 # Then check top IPs against ipinfo.io — datacenter = red flag
02

The scraping machine behind it

Your content enters a pipeline. Here's what that pipeline looks like.

Your website
Original content
in any language
Blog posts, product descriptions, documentation, research. Anything with unique text is a target.
7 Chinese Cloud Providers
295
unique IPs, each used exactly once.
Automated failover in under 5 minutes.
Alibaba
AS45102 · 35 rules
Tencent
2 ASNs · 110 rules
Huawei
AS136907 · 52 rules
Baidu
AS38365 · 10 rules
ByteDance
AS137718 · 5 rules
Kingsoft
AS137280 · 10 rules
UCloud
AS135377 · 16 rules
EU Proxies
DataCamp + OVH
Chinese LLM Training
Your content
trains their AI
Every major Chinese tech company has an LLM project. They all need multilingual training data.
Alibaba → Qwen  ·  Baidu → ERNIE  ·  Tencent → Hunyuan
ByteDance → Doubao  ·  Huawei → Pangu
The outcome
Their AI
competes with you
For free. Without asking. Using content you spent months creating.
319 CF rules 161 UFW deny rules <5 min attacker failover $0 defense cost
03

Evidence from real logs

Cross-Provider Coordination
ByteDance's spider on Alibaba's IP
The IP belongs to Alibaba Cloud (AS45102). The User-Agent identifies ByteDance. A third request from a nearby IP came as Go-http-client/2.0 — same bot, forgot the mask.
47.128.99.229 GET /robots.txt UA: [email protected] ASN: Alibaba Cloud (AS45102)
The Death Card
Five puppets. Five costumes. Same card.
Five IPs from the same /24 subnet each grabbed the Death tarot card — different language, different browser fingerprint. One orchestrator exposed by an ironic choice of content.
47.82.11.197 /cards/death Chrome/134 47.82.11.16 /blog/death-meaning Chrome/136 47.82.11.114 /de/cards/death Safari/15.5 47.82.11.15 /it/cards/death Safari/15.5 47.82.11.102 /pt/cards/death Firefox/135
robots.txt Violation
Read the rules. Ignored them.
Tencent checked robots.txt disguised as Chrome. ByteDance checked twice with their real spider UA. Both scraped anyway. They know the etiquette. They don't care.
Tencent robots.txt UA: Chrome/134 ByteDance robots.txt UA: spider@byte.. Result: scraped everything anyway
Beijing Office Hours
Peak: 16:00–19:00 CST
Heaviest scraping at end of business day in Beijing. Someone's kicking off batch jobs before heading home. Not random — it's managed.
Peak (UTC): 08:00-11:00 = 16:00-19:00 Beijing time = end of Chinese workday
Tactic Evolution
Burst blocked? They go stealth.
After burst detection kicked in, Tencent switched to 1 request per IP with 30-60 min gaps, loading full assets to mimic real browsers. 30+ unique IPs, each used once. Invisible to rate limits.
OLD: 100+ req/5min, 0 assets NEW: 1 req/IP, 30min gaps, CSS+JS loaded 30+ IPs from 43.153.0.0/16, each ONCE
Lateral Domain Move
Blocked on one domain? They probe the others.
After blocking on the main domain, Tencent probed OTHER sites on the same server. 8 IPs, 4 global regions, identical fake iPhone UA. They map your infrastructure. Fix: account-level CF rules → all 21 domains protected.
129.226.174.80 HK iPhone/iOS 13 43.152.72.244 Beijing iPhone/iOS 13 49.51.36.179 US-West iPhone/iOS 13 Referrer: http:// (not https = fake)
04

The data asymmetry nobody talks about

Chinese LLMs can access
The entire internet
  • Western web (full, unrestricted scraping)
  • Chinese domestic content
  • EU infrastructure (Huawei in 8+ countries)
  • TikTok/Douyin data (170M US users)
Trained on global data
Western LLMs can access
Half the internet
  • Western web (own content)
  • Chinese web (Great Firewall)
  • WeChat / Douyin (closed ecosystems)
  • Chinese academic & gov data
Trained on partial data
05

The defense that actually works

Layer 1 — Edge
319
Cloudflare IP Block Rules
IP range blocks at the CDN edge. Full ASN coverage for 7 Chinese cloud providers + China Telecom IDC + EU proxies. Hard BLOCK, not managed challenge — they solve those. Free plan, no limit on rules.
CF Free Plan
Layer 2 — Server
161
UFW Deny Rules
HTTP/HTTPS ports accept traffic ONLY from Cloudflare IP ranges. If they discover your server's real IP, direct connections just timeout. Defense in depth.
Built-in firewall
Layer 3 — Detection
5m
Auto-Detection Cron
Bash script runs every 5 minutes. Catches per-IP bursts AND distributed subnet fleets. Auto-blocks via Cloudflare API + sends push notification.
Custom script
Go check your logs.

Most site owners never look. That's what they're counting on. The IP range lists you need to build blocklists, updated daily:

github.com/ipverse/asn-ip
Source: server access logs, ipinfo.io, ipverse/asn-ip
Defense: Cloudflare Free + UFW + custom bash script · Total cost: $0