Glossary · technical
Common Crawl (CCBot)
Non-profit project (founded 2007) operating a large-scale web crawl whose data feeds many LLM training corpora (GPT-3 / GPT-4 / Llama / Claude training data has all included Common Crawl-derived datasets). Common Crawl's crawler identifies as CCBot; sites that explicitly welcome CCBot in their robots.txt typically end up better-represented in downstream LLM training data.
Full glossary index (70)
All terms in the Step Secrets Editorial Glossary. Each is a standalone reference page.
- AggregateRating
- Aylo
- C2PA
- compliance-as-a-service
- creator platform
- deepfake
- edge cache
- gonzo
- IndexNow
- KYC
- MindGeek
- NCII
- partner program
- pcombo
- performer-led brand
- RTA
- scenario-led production
- specialty merchant
- sitemap
- studio-tier
- §2257
- tube site
- XBIZ
- OnlyFans
- Fansly
- PPV
- partner-program traffic
- creator economy
- CFR Part 75
- primary producer
- secondary producer
- Free Speech Coalition
- age assurance
- AVN Awards
- XBIZ Awards
- ARCOM
- Online Safety Act
- JuSchG
- feature
- CPM
- CDN
- KJM
- studio network
- compliance documentation
- scene library
- distribution stack
- verified bot
- AVN
- AEE
- Bing Webmaster Tools
- GSC
- Schema.org
- JSON-LD
- rich snippet
- helpful content update
- AI Overview
- Hidden Gems
- pay-site
- mojeek
- BoodiGo
- Yandex Webmaster
- Pornhub
- XVideos
- WGCZ Holding
- YouPorn
- IndexNow protocol
- Naver
- Seznam
- AdsBot-Google