Is Your Content Training Chinese AI? — How to Tell. What to Do.

01

Check your logs right now

Pull your access logs and look for this pattern. If 3 or more match, you're being scraped for AI training data. Probably right now.

1

IP belongs to a datacenter, not a residential ISP

Look for Alibaba, Tencent, Huawei, AWS, Azure ranges. Real users come from Comcast, Vodafone, T-Mobile — not cloud providers.

2

Single GET request to a content page

One request, one page, then the IP is never seen again. A real user browses multiple pages per session. A scraper grabs one and rotates.

3

Zero follow-up requests for assets

No .css .js .png .woff2 loaded. They want your text, not your design. A browser without CSS is a bot without a costume.

4

User-Agent looks like a normal browser

Rotating between Chrome/134 Safari/15 Firefox/135 and even mobile UAs. A datacenter IP claiming to be an iPhone is not an iPhone.

5

No referer header

Real visitors come from Google, social media, bookmarks with referers. Scrapers hit URLs directly from a list. No search query. No social click. Just raw URL access.

# Quick check — run on your server awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20 # Then check top IPs against ipinfo.io — datacenter = red flag

02

The scraping machine behind it

Your content enters a pipeline. Here's what that pipeline looks like.

Your website

Original content
in any language

Blog posts, product descriptions, documentation, research. Anything with unique text is a target.

7 Chinese Cloud Providers

295

unique IPs, each used exactly once.
Automated failover in under 5 minutes.

Alibaba
AS45102 · 35 rules

Tencent
2 ASNs · 110 rules

Huawei
AS136907 · 52 rules

Baidu
AS38365 · 10 rules

ByteDance
AS137718 · 5 rules

Kingsoft
AS137280 · 10 rules

UCloud
AS135377 · 16 rules

EU Proxies
DataCamp + OVH

Chinese LLM Training

Your content
trains their AI

Every major Chinese tech company has an LLM project. They all need multilingual training data.

Alibaba → Qwen · Baidu → ERNIE · Tencent → Hunyuan
ByteDance → Doubao · Huawei → Pangu

The outcome

Their AI
competes with you

For free. Without asking. Using content you spent months creating.

319 CF rules 161 UFW deny rules <5 min attacker failover $0 defense cost

03

Evidence from real logs

Cross-Provider Coordination

ByteDance's spider on Alibaba's IP

The IP belongs to Alibaba Cloud (AS45102). The User-Agent identifies ByteDance. A third request from a nearby IP came as Go-http-client/2.0 — same bot, forgot the mask.

47.128.99.229 GET /robots.txt UA: [email protected] ASN: Alibaba Cloud (AS45102)

The Death Card

Five puppets. Five costumes. Same card.

Five IPs from the same /24 subnet each grabbed the Death tarot card — different language, different browser fingerprint. One orchestrator exposed by an ironic choice of content.

47.82.11.197 /cards/death Chrome/134 47.82.11.16 /blog/death-meaning Chrome/136 47.82.11.114 /de/cards/death Safari/15.5 47.82.11.15 /it/cards/death Safari/15.5 47.82.11.102 /pt/cards/death Firefox/135

robots.txt Violation

Read the rules. Ignored them.

Tencent checked robots.txt disguised as Chrome. ByteDance checked twice with their real spider UA. Both scraped anyway. They know the etiquette. They don't care.

Tencent robots.txt UA: Chrome/134 ByteDance robots.txt UA: spider@byte.. Result: scraped everything anyway

Beijing Office Hours

Peak: 16:00–19:00 CST

Heaviest scraping at end of business day in Beijing. Someone's kicking off batch jobs before heading home. Not random — it's managed.

Peak (UTC): 08:00-11:00 = 16:00-19:00 Beijing time = end of Chinese workday

Tactic Evolution

Burst blocked? They go stealth.

After burst detection kicked in, Tencent switched to 1 request per IP with 30-60 min gaps, loading full assets to mimic real browsers. 30+ unique IPs, each used once. Invisible to rate limits.

OLD: 100+ req/5min, 0 assets NEW: 1 req/IP, 30min gaps, CSS+JS loaded 30+ IPs from 43.153.0.0/16, each ONCE

Lateral Domain Move

Blocked on one domain? They probe the others.

After blocking on the main domain, Tencent probed OTHER sites on the same server. 8 IPs, 4 global regions, identical fake iPhone UA. They map your infrastructure. Fix: account-level CF rules → all 21 domains protected.

129.226.174.80 HK iPhone/iOS 13 43.152.72.244 Beijing iPhone/iOS 13 49.51.36.179 US-West iPhone/iOS 13 Referrer: http:// (not https = fake)

04

The data asymmetry nobody talks about

Chinese LLMs can access

The entire internet

Western web (full, unrestricted scraping)
Chinese domestic content
EU infrastructure (Huawei in 8+ countries)
TikTok/Douyin data (170M US users)

Trained on global data

Western LLMs can access

Half the internet

Western web (own content)
Chinese web (Great Firewall)
WeChat / Douyin (closed ecosystems)
Chinese academic & gov data

Trained on partial data

05

The defense that actually works

Layer 1 — Edge

319

Cloudflare IP Block Rules

IP range blocks at the CDN edge. Full ASN coverage for 7 Chinese cloud providers + China Telecom IDC + EU proxies. Hard BLOCK, not managed challenge — they solve those. Free plan, no limit on rules.

CF Free Plan

Layer 2 — Server

161

UFW Deny Rules

HTTP/HTTPS ports accept traffic ONLY from Cloudflare IP ranges. If they discover your server's real IP, direct connections just timeout. Defense in depth.

Built-in firewall

Layer 3 — Detection

5m

Auto-Detection Cron

Bash script runs every 5 minutes. Catches per-IP bursts AND distributed subnet fleets. Auto-blocks via Cloudflare API + sends push notification.

Custom script