CloudFlare is a wonderful thing. An excellent DNS provider coupled with a hosted "intelligent" caching reverse proxy? Sign me up!
Moreover, some of the functionality offered up by CloudFlare can only be described as awesome - From the ability to project a WAF over your entire site, to automatically minifying all static assets, to even offer a proprietary ESI alternative in the form of RailGun.
CloudFlare has powered my site, that of StormBit, the Pomf.se hosting service and countless others that I've been involved with over the past few years and all without a single hitch. Alongside saving us bandwidth and offering lighting-fast DNS updates, unlimited DNS entries and flexible SSL, it has also kept sites serving content even when they have been down, or under heavy load. With CloudFlare I can honestly say that small websites finally have a way of ensuring sites don't suffer the Slashdot effect.
However this is not intended to be an advert for CloudFlare. Rather I have been hit with their "I'm under attack mode", which is all well and good when one is actually under attack as it almost completely negates the ability of bots to hit applications and make away with precious bandwidth. Unfortunately I'm not quite a bot, and instead of trying to steal bandwidth or steal email addresses, I'm trying to scrape information (for playlists) from websites which is at least a tiny bit less insidious.
Looking over how CloudFlare decides whether or not I am a bot, it evaluates Javascript in the browser to solve a challenge, and this challenge is per-domain, per-IP, per-session and per-useragent so I can't just solve the challenge once and add it to all my cookie jars. Additionally, there is no sane way of manually adding all of the required cookies to new cookie jars, so that's not an option either.
My most immediate thought was to try and solve by guessing patterns and capturing values with regular expressions, however CloudFlare's engineers are several steps ahead and are actually obfuscating their challenge payload.
After trying a bit of trickery in node.js to try and capture and evaluate the challenge, I realised that there's already a far simpler way around this- Using a browser testing engine such as PhantomJS.
Ten minutes later, I had whipped together the following, which at time of writing allows my scraping tools past CloudFlare with negligable failures.
/** | |
* This is a project designed to get around sites using Cloudflare's "I'm under attack" mode. | |
* Using the PhantomJS headless browser, it queries a site given to it as the second parameter, | |
* waits six seconds and returns the cookies required to continue using this site. With this, | |
* it is possible to automate scrapers or spiders that would otherwise be thwarted by Cloudflare's | |
* anti-bot protection. | |
* | |
* To run this: phantomjs cloudflare-challenge.js http://www.example.org/ | |
* | |
* Copyright © 2015 by Alex Wilson <[email protected]> | |
* | |
* Permission to use, copy, modify, and/or distribute this software for | |
* any purpose with or without fee is hereby granted, provided that the | |
* above copyright notice and this permission notice appear in all | |
* copies. | |
* | |
* THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES WITH | |
* REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF | |
* MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR ANY | |
* SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES | |
* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN | |
* ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT | |
* OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. | |
*/ | |
/** | |
* Namespaced object. | |
* @type {Object} | |
*/ | |
var antoligy = antoligy || {}; | |
/** | |
* Simple wrapper to retrieve Cloudflare's 'solved' cookie. | |
* @type {Object} | |
*/ | |
antoligy.cloudflareChallenge = { | |
webpage: false, | |
system: false, | |
page: false, | |
url: false, | |
userAgent: false, | |
/** | |
* Initiate object. | |
*/ | |
init: function() { | |
this.webpage = require('webpage'); | |
this.system = require('system'); | |
this.page = this.webpage.create(); | |
this.url = this.system.args[1]; | |
this.userAgent = 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0'; | |
this.timeout = 6000; | |
}, | |
/** | |
* "Solve" Cloudflare's challenge using PhantomJS's engine. | |
* @return {String} JSON containing our cookies. | |
*/ | |
solve: function() { | |
var self = this; | |
this.page.settings.userAgent = this.userAgent; | |
this.page.open(this.url, function(status) { | |
setTimeout(function() { | |
console.log(JSON.stringify(phantom.cookies)); | |
phantom.exit() | |
}, self.timeout); | |
}); | |
} | |
} | |
/** | |
* In order to carry on making requests, both user agent and IP address must what is returned here. | |
*/ | |
antoligy.cloudflareChallenge.init(); | |
antoligy.cloudflareChallenge.solve(); |
*[WAF]: Web Application Firewall *[ESI]: Edge Server Includes