Scan 10M websites for X-Recruiting header using GO on AWS Free Tier instance

Requirements and constraints:

  1. Scanning should be done using workers (we have huge list of domains to scan)
  2. Random DNS servers should be used (otherwise we will be banned by a DNS server, cause we’ll do so many DNS lookups)
  3. Memory usage. We want our app to use a small amount a memory, to be able to use Free-tier instance that has only 1Gb RAM)
wget https://www.domcop.com/files/top/top10milliondomains.csv.zip
unzip unzip top10milliondomains.csv.zip
wget http://downloads.majestic.com/majestic_million.csv
wget http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
unzip top-1m.csv.zip
awk -F "\"*,\"*" '{print $2}' top-1m.csv > umbrella-1m-domains.txt
awk
-F "\"*,\"*" 'NR>1 {print $3}' majestic_million.csv > majestic-1m-domains.txt
awk
-F "\"*,\"*" 'NR>1 {print $2}' top10milliondomains.csv > domcop-10m-domains.txt
cat
domcop-10m-domains.txt majestic-1m-domains.txt umbrella-1m-domains.txt | sort | uniq -u > uniq-domains.txt
wget https://public-dns.info/nameservers.csv
awk -F "\"*,\"*" 'NR>1, NF > 0 {print $1}' nameservers.csv > dns-servers.txt
dns-servers.txt
uniq-domains.txt

About the code.

DNS lookup implemented using popular http://github.com/miekg/dns package. Some interface defined to be able to change IP lookups implementation if needed. Resolver loads file’s content to the memory, and then uses random servers for each lookup. Resolve method returns list of returned IPs, if found.

Build and run the app locally

To run it locally you need to clone the project, install dependencies, unzip some data files, install it and run.

# clone the project
mkdir -p ${GOPATH}/src/github.com/spaiz/
cd ${GOPATH}/src/github.com/spaiz/
git clone git@github.com:spaiz/hrscanner.git .
cd hrscanner
# install dependency manager use in the project
go get -u github.com/kardianos/govendor
# install project's dependencies
govendor sync
# install the app
go install
# unzip domains and DNS servers files
cd ${GOPATH}/src/github.com/spaiz/hrscanner/data/
unzip dns-servers.txt.zip && rm dns-servers.txt.zip
unzip uniq-domains.txt.zip && uniq-domains.txt.zip
cd ${GOPATH}/src/github.com/spaiz/hrscanner/
# run the app with default settings
hrscanner

Run the app on the server

I use MacBook, so to build a binary file to be run on Linux, I use static compiling inside the Docker. I prepared tiny script for this.

./bin/build.sh
scp -r -C -i ~/.ssh/mykey.pem ${GOPATH}/src/github.com/spaiz/hrscanner/data/ ec2-user@remote_host:~/data/scp -i ~/.ssh/mykey.pem ${GOPATH}/src/github.com/spaiz/hrscanner/artifacts/hrscanner ec2-user@remote_host:~/
sudo touch /etc/security/limits.d/custom.conf
* soft nofile 1000000
* hard nofile 1000000
sudo nano /etc/sysctl.conf
fs.file-max = 1000000
fs.nr_open = 1000000
net.ipv4.netfilter.ip_conntrack_max = 1048576
net.nf_conntrack_max = 1048576
ulimit -n
sudo apt install tmux
tmux
# run the app with custom workers number
./hrscanner -workers=500 > logs.txt &
# exit the session
tmux detach
# now u can close the terminal
tmux attach

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alexander Ravikovich

Alexander Ravikovich

In GO we trust. Software Engineer. @Isreal