IT IS HERE! Get Smashing Node.JS on Amazon Kindle today!
Show Posts
← Back to homepage

We all have that guy who tweets too much about a certain topic we dislike. Or all those hardcore seasonal fans of the current NFL sensation. Or those annoying spotify/rdio links you’re never going to click!

I’ve been waiting for a while for mainstream Twitter clients to allow keyword or pattern based filtering, and when I realized people were starting to tweet about the 49ers the night before the game, I decided to finally address it myself.

The technique I’m going to describe consists in setting up a proxy to the Twitter API that lives in our computer, powered by Node.JS, that filters back responses based on arbitrary rules we setup. Not only does this give us the flexibility of using JavaScript, but it also means it will work with any Twitter client that leverages the API transparently. Yes, this includes the Twitter website!

The premises

Before we begin, we must fully understand how the majority of Twitter desktop (and mobile) clients work.

  1. The app first needs to authenticate the user. Two methods can be used:

    OAuth: relies on the user going to a web page that grants access.

    XAuth: asks for user and password. This is the most practical method of authentication for a custom client, and this is the technique that Twitter for Mac uses.

  2. Once this authorization is successful, the Twitter client will make HTTP requests to the Twitter API (located at api.twitter.com). In order to make these requests on behalf of the user, it must first sign them. To sign a request means to add an Authorization header that’s created with a token obtained on step 1.
  3. Just like any other application that leverages the API, the Twitter client will then exchange content in some format (XML or JSON).

Our plan of action

  1. We’ll leverage the internal DNS lookup system (/etc/hosts) to override api.twitter.com and make it point to a Node.JS server hosted locally.
  2. Since the Twitter client will attempt to connect to our overridden IP over port 443, we’ll set up a “second” loopback interface 127.0.0.2. This will prevent our Twitter hack from getting in the way of your projects and apps that try to serve traffic on port 443.
  3. To bypass SSL security checks in the client, we’ll set up a new root certificate for our computer that validates 127.0.0.2 claims that it is api.twitter.com.
  4. We’ll forward all requests to Twitter, but we’ll capture the response of the timeline one to execute our filters.
  5. We’ll expose a middleware API for filters to be setup.

Forwarding api.twitter.com

Redirecting api.twitter.com to 127.0.0.2 is as straightforward as editing /etc/hosts:

∞ ~ sudo sh -c "echo '127.0.0.2 api.twitter.com' >> /etc/hosts"
∞ ~ tail -n 1 /etc/hosts
127.0.0.2 api.twitter.com

There’s no place like 127.0.0.2

Our plan is to capture all the HTTPS traffic going to api.twitter.com over port 443, by forwarding it to 127.0.0.2. To accomplish this, we must first make that IP resolve to our loopback interface.

On OS X, we can accomplish this by aliasing it to lo0:

∞ ~ sudo ifconfig lo0 alias 127.0.0.2

As a result, we can now bind to port 443 over distinct IP addresses. As a test, you can run the following Node programs (with sudo):

require('http').createServer(function (req, res) {
  res.writeHead(200);
  res.end('welcome to 127.0.0.1:80');
}).listen(80, '127.0.0.1');
require('http').createServer(function (req, res) {
  res.writeHead(200);
  res.end('welcome to 127.0.0.2:80');
}).listen(80, '127.0.0.2');

Pointing your browser to http://127.0.0.1 and http://127.0.0.2 will yield different responses.

Port 443 or port 3000?

In our previous Node servers examples, binding to port 80 forced us to use sudo, since it’s considered a privileged port. We can skip this every time we need to run our server by forwarding the traffic destined to port 443 over 127.0.0.2 to the unprivileged port 3000:

∞ ~ sudo ipfw add fwd 127.0.0.2,3000 tcp from any to 127.0.0.2 443 in

As a result, we can now run code like this (notice I run listen() with 3000)

function read (file) { return require('fs').readFileSync(__dirname + '/' + file, 'utf8'); };
require('https').createServer({
    key: read('key.key')
  , cert: read('cert.crt')
}, function (req, res) {
  res.writeHead(200);
  res.end('welcome to ' + req.headers.host + ' bound to 127.0.0.2:3000');
}).listen(3000, '127.0.0.2');

But still point our browser to https://api.twitter.com, just like the Twitter client will do:

For this example, I used self signed certificates that will trigger a security warning. You can download this example from here.

Hacking SSL

At this point, not only will our Twitter client be unable to make any requests (since we’re not proxying them yet), but it will be refuse to establish any connection at all, due to the SSL checks:

The client is onto us! Stop SOPA!

Thankfully, it’s fairly for us to create a new certificate Root (like VeriSign), then generate a certificate request that we sign our own Twitter certificate with. If we then install the root certificate at the system level, the Twitter client and browsers will recognize it as legitimate.

We start by creating our root. For this, we can leverage a perl script that ships with OpenSSL. We also first create a ~/root directory where all the generated files will be kept:

 ∞ ~ mkdir ~/root
 ∞ ~ cd ~/root/
 ∞ root perl /System/Library/OpenSSL/misc/CA.pl -newca

This will ask a series of questions, most of which you can respond with the default value. Set the passphrase when asked. When it asks for a Common Name, set it to something easily identifiable like “Guillermo”, in case you want to remove the root certificate in the future from your system.

The result will be the creation of a directory called demoCA, which contains the root certificate database.

Now that we rolled our own VeriSign, we want to do what we normally do when we get real certificates for our websites. Generate a key, generate a signing request and sign it ourselves:

 ∞ root openssl genrsa 1024 > api.twitter.com.key
Generating RSA private key, 1024 bit long modulus
....++++++
.......++++++
e is 65537 (0x10001)
 ∞ root openssl req -new -key api.twitter.com.key -out newreq.pem

The second command will again ask a series of questions. The only one that matters is the Common Name. Set it to api.twitter.com

Our friend the perl script can also aid in the process of signing a request file (in this case newreq.pem).

 ∞ root perl /System/Library/OpenSSL/misc/CA.pl -sign

The signature process is very difficult and gruesome and that’s why certificates are very expensive: type in your passcode, write y and press enter.

The signature process will yield a file newcert.pem, which we want to rename to api.twitter.com.crt to easily identify it.

We now have the .key and .crt files that we’ll give to our Node web server. But before we let the Twitter client attempt a connection, we install the root certificate located at ~/root/demoCA by double-clicking it:

Proxying

Let’s try setting up our newly signed certificate and outputting our incoming requests.

require('https').createServer({
    key: read('api.twitter.com.key')
  , cert: read('api.twitter.com.crt')
}, function (req, res) {
  console.error(' - \033[96m%s\033[90m %s\033[39m'
    , req.method, req.url);
  res.writeHead(500);
  res.end();
}).listen(3000, '127.0.0.2');

If we run our server and relaunch Twitter for mac, we should see the requests flowing in:

Our own Twitter client now trusts us! And since we’re replying with status code 500, it reflects it:

Our proxying strategy is as follows:

  1. We capture all requests.
  2. We create new requests (using superagent) that we send to twitter over their IP (since api.twitter.com resolves to our own computer).
  3. To avoid dealing with gzip, we want to re-write the responses to not include Content-Encoding. Ideally we’d also re-write the request to exclude Accept-Encoding, but this would invalidate the OAuth signature.
  4. When the response comes back, we send it back unless it targets the home_timeline API endpoint.
  5. For timeline requests, we parse the XML, we filter or change what we want, we re-encode it and send it back. Since we’re changing the response body, some headers will need to be altered, like Content-Length.

Re-writing

To illustrate the rewriting process, consider this example in which we replace all our tweets with the text “haha”:

The first thing we want to do is detect the requests for the home_timeline.xml endpoint:

// detect timeline xml request
var filter = /home_timeline\.xml/.test(req.url);

Then, we want to buffer all the request data for the incoming request:

var data = '';
req.on('data', function (chunk) { data += chunk; });
req.on('end', function () {
  // …
});

We finally create our new request, and parse the XML with the excellent libxmljs by Marco Rogers.

var proxy = request[req.method.toLowerCase()]
  ('https://199.59.149.232' + req.url)

if (data) proxy.send(data);

proxy.end(function (apiRes) {
  // delete content-encoding
  delete apiRes.headers['content-encoding'];

  // parse xml
  if (200 == apiRes.status && filter) {
    var doc = libxml.parseXmlString(apiRes.text);
    doc.find('//text').forEach(function (node, i) {
      node.text('haha');
    });
    proxy(doc.toString());
  } else {
    proxy(apiRes.text);
  }

  function proxy(text) {
    // rewrite content-length
    apiRes.headers['content-length'] = text.length;
    res.writeHead(apiRes.status, apiRes.headers);
    res.end(text);
  }
});

This is a slightly over-simplified example. In reality, in order for the OAuth signature to stay valid we need to proxy the headers exactly as we receive them. Node.JS unfortunately lowercases all headers, and drops the original strings. I performed the following replacement to try to restore them to their originals based on convention:

for (var i in req.headers) {
  var h = i.replace(/^[a-z]|-[a-z]/g, function (a) {
    return a.toUpperCase();
  });
  proxy.set(h, req.headers[i]);
}

In part II I’ll describe the creation of the API that lives on top of the proxy to ease on the tweet re-writing and filtering process exposed to the end user through a NPM module.

6 Comments

Jeremy Dunck said

This is weirdly exactly the same idea I had for filtering twitter some time ago. See also: http://proxlet.com/ (They offer a chrome ext, but that’s just for market penetration – the actual service is an API-compatible proxy.

As for API to change filter settings, I was imaging having a special twitter user – you’d do a mention + some DSL to it, and the proxy would process those specially, not sending them on to twitter.

@twilter help
-> commands: foo bar baz
@twilter remove /tebow/i
-> Npw removing all tweets which match the regex /tebow/i
@twilter list
-> removing all tweets which match the regex /tebow/i
-> muting all tweets which match /football/i for 3 more hours
-…

Francis THOMAS said

Very interesting project.

As far as I know, there is one Twitter client allowing advanced filtering (iTunes smart playlist style, see picture : http://www.lecentre.net/fratoblog/wp-content/uploads/2009/12/socialite-smart-folder.jpg), Socialite, formerly created by Realmac Software, now sold to Apparent Software (http://www.apparentsoft.com/socialite). But it is quite buggy, and turns to be very slow. They are working on a v2, will see if it gets better.

Nervertheless, I’m very interested by your project.

Wayne Bloore said

Interesting idea. We are currently working on a full Twitter client that does filtering – Tweetdig.com. It is browser based only at the moment, but mobile and desktop apps will be coming in the next month.

Ted Leung said

Are you still going to write a part 2?

Bruno Tagliani said

Nice!

Bruno Tagliani said

Can you write part 2 please?

Your thoughts?

About Guillermo Rauch:

CTO and co-founder of LearnBoost / Cloudup (acquired by Automattic in 2013). Argentine living in SF.