Scan 10M websites for X-Recruiting header using GO on AWS Free Tier instance
What are you doing when you looking for a new job? Right, you are going to contact HR or search on websites like LinkedIn, Glassdoor, StackOverflow, etc.
Let’s try another approach :) Did you hear about ‘X-Recruiting’ header? For example, if you look at response headers from PayPal.com, you can see this ‘strange’ header.
Interesting, how many companies use this smart way to find appropriate candidates?
We will try to answer this question using GO and AWS Free-tier instance. You can run the app on your own machine if you have a stable and good internet connection. For me personally, it didn’t work very well because the router was freeze after an hour and needed to be restarted (too many UDP requests).
Requirements and constraints:
- Scanning should be done using workers (we have huge list of domains to scan)
- Random DNS servers should be used (otherwise we will be banned by a DNS server, cause we’ll do so many DNS lookups)
- Memory usage. We want our app to use a small amount a memory, to be able to use Free-tier instance that has only 1Gb RAM)
Firstly, I downloaded domains datasets from multiple sources:
- https://www.domcop.com/top-10-million-domains
- https://blog.majestic.com/development/majestic-million-csv-daily/
- http://s3-us-west-1.amazonaws.com/umbrella-static/index.html
wget https://www.domcop.com/files/top/top10milliondomains.csv.zip
unzip unzip top10milliondomains.csv.zip
wget http://downloads.majestic.com/majestic_million.csv
wget http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
unzip top-1m.csv.zip
Then I need to clean, merge, filter and save uniq domains to a separate file, that will be used for scanning.
awk -F "\"*,\"*" '{print $2}' top-1m.csv > umbrella-1m-domains.txt
awk -F "\"*,\"*" 'NR>1 {print $3}' majestic_million.csv > majestic-1m-domains.txt
awk -F "\"*,\"*" 'NR>1 {print $2}' top10milliondomains.csv > domcop-10m-domains.txt
cat domcop-10m-domains.txt majestic-1m-domains.txt umbrella-1m-domains.txt | sort | uniq -u > uniq-domains.txt
Secondly, I need to prepare DNS servers list to be used for domain’s IP lookups.
wget https://public-dns.info/nameservers.csv
awk -F "\"*,\"*" 'NR>1, NF > 0 {print $1}' nameservers.csv > dns-servers.txt
Finally, I have two files that will be used by the app:
dns-servers.txt
uniq-domains.txt
About the code.
DNS lookup implemented using popular http://github.com/miekg/dns package. Some interface defined to be able to change IP lookups implementation if needed. Resolver loads file’s content to the memory, and then uses random servers for each lookup. Resolve method returns list of returned IPs, if found.
The app implemented using GO’s version of Producer-Consumer pattern with channels. The Worker
is our consumer. It received the jobs
(domains) to be scanned from JobQueue
channel and exits when the channel closed. (Producer responsible for closing the channel when there are no domains remained for scanning).
Because most of our work is IO operations, we can create thousands of go-routines (workers) and benefit from it even on 1-CPU machine.
We also use WaitGroup to wait until all workers finish the jobs.
We also use a buffered channel when reading domains from the file. It allows to the app to use small amount of memory. We could just load whole the file to the memory (and actually it did on my first try), but then we will need use server with more memory, or create a SWAP file if we still want to use it on EC2 Free-tier instance.
Object App
is our Consumer. It’s responsible for workers creation, synchronization and receiving completed jobs from the workers.
Full code you can see in https://github.com/spaiz/hrscanner
Build and run the app locally
To run it locally you need to clone the project, install dependencies, unzip some data files, install it and run.
# clone the project
mkdir -p ${GOPATH}/src/github.com/spaiz/
cd ${GOPATH}/src/github.com/spaiz/
git clone git@github.com:spaiz/hrscanner.git .
cd hrscanner# install dependency manager use in the project
go get -u github.com/kardianos/govendor# install project's dependencies
govendor sync# install the app
go install# unzip domains and DNS servers files
cd ${GOPATH}/src/github.com/spaiz/hrscanner/data/
unzip dns-servers.txt.zip && rm dns-servers.txt.zip
unzip uniq-domains.txt.zip && uniq-domains.txt.zip
cd ${GOPATH}/src/github.com/spaiz/hrscanner/# run the app with default settings
hrscanner
Run the app on the server
I use MacBook, so to build a binary file to be run on Linux, I use static compiling inside the Docker. I prepared tiny script for this.
./bin/build.sh
It will create and put binary file inside artifacts
directory. Now, just upload the binary and data files to the servers, and run it. I use scp
tool fo this (you need to setup SSH access to your server using keys).
scp -r -C -i ~/.ssh/mykey.pem ${GOPATH}/src/github.com/spaiz/hrscanner/data/ ec2-user@remote_host:~/data/scp -i ~/.ssh/mykey.pem ${GOPATH}/src/github.com/spaiz/hrscanner/artifacts/hrscanner ec2-user@remote_host:~/
Before running the app on the server, we should increase max open files settings, otherwise we will get well known error:
too many open files
There are multiple ways to achieve this. I did it using a method that will persist the settings even after server restart. All actions made on Free-tier Amazon instance from EMI Image
Create new file:
sudo touch /etc/security/limits.d/custom.conf
And put:
* soft nofile 1000000
* hard nofile 1000000
Then edit /etc/sysctl.conf
sudo nano /etc/sysctl.conf
And add to end of the file:
fs.file-max = 1000000
fs.nr_open = 1000000
net.ipv4.netfilter.ip_conntrack_max = 1048576
net.nf_conntrack_max = 1048576
You must reconnect again to the server. To check that new settings have been applied, run:
ulimit -n
I use tmux
to run the app on the server. It allows me to close the terminal and the app will continue running. Just install it on the server and run it.
sudo apt install tmux
tmux# run the app with custom workers number
./hrscanner -workers=500 > logs.txt &# exit the session
tmux detach # now u can close the terminal
Next time you connect to the server, you can open the previous session by typing
tmux attach
All domains with X-recruiting header will be saved to the file results.txt
(can be changed by flags).
In my case, the app was able to start from ~700 RPS and then after some time, it has decreased to stable ~250 RPS. (~900,000 requests per hour)
And the results.txt
file will look like
Source code:
https://github.com/spaiz/hrscanner
P.S.
The solution isn’t ideal. There are no retries… there is no warranty that all DNS servers work well… I don’t check multiple DNS A records if the first HTTP request failed… but still, it’s good enough for me :)
Tip.
You can create your own domains list, for example by scrapping angel.co or crunchbase.com and selecting only relevant hi-tech companies ;)
P.S2
The app is still running. I’ll update results when it will finish scanning all 10M websites :)
Update.
After a more than 24h running, the app found 1873 domains with X-Recruiting header. You can see the report here.