Scan 10M websites for X-Recruiting header using GO on AWS Free Tier instance

Requirements and constraints:

wget https://www.domcop.com/files/top/top10milliondomains.csv.zip
unzip unzip top10milliondomains.csv.zip
wget http://downloads.majestic.com/majestic_million.csv
wget http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
unzip top-1m.csv.zip
awk -F "\"*,\"*" '{print $2}' top-1m.csv > umbrella-1m-domains.txt
awk
-F "\"*,\"*" 'NR>1 {print $3}' majestic_million.csv > majestic-1m-domains.txt
awk
-F "\"*,\"*" 'NR>1 {print $2}' top10milliondomains.csv > domcop-10m-domains.txt
cat
domcop-10m-domains.txt majestic-1m-domains.txt umbrella-1m-domains.txt | sort | uniq -u > uniq-domains.txt
wget https://public-dns.info/nameservers.csv
awk -F "\"*,\"*" 'NR>1, NF > 0 {print $1}' nameservers.csv > dns-servers.txt
dns-servers.txt
uniq-domains.txt

About the code.

Build and run the app locally

# clone the project
mkdir -p ${GOPATH}/src/github.com/spaiz/
cd ${GOPATH}/src/github.com/spaiz/
git clone git@github.com:spaiz/hrscanner.git .
cd hrscanner
# install dependency manager use in the project
go get -u github.com/kardianos/govendor
# install project's dependencies
govendor sync
# install the app
go install
# unzip domains and DNS servers files
cd ${GOPATH}/src/github.com/spaiz/hrscanner/data/
unzip dns-servers.txt.zip && rm dns-servers.txt.zip
unzip uniq-domains.txt.zip && uniq-domains.txt.zip
cd ${GOPATH}/src/github.com/spaiz/hrscanner/
# run the app with default settings
hrscanner

Run the app on the server

./bin/build.sh
scp -r -C -i ~/.ssh/mykey.pem ${GOPATH}/src/github.com/spaiz/hrscanner/data/ ec2-user@remote_host:~/data/scp -i ~/.ssh/mykey.pem ${GOPATH}/src/github.com/spaiz/hrscanner/artifacts/hrscanner ec2-user@remote_host:~/

too many open files

sudo touch /etc/security/limits.d/custom.conf
* soft nofile 1000000
* hard nofile 1000000
sudo nano /etc/sysctl.conf
fs.file-max = 1000000
fs.nr_open = 1000000
net.ipv4.netfilter.ip_conntrack_max = 1048576
net.nf_conntrack_max = 1048576
ulimit -n
sudo apt install tmux
tmux
# run the app with custom workers number
./hrscanner -workers=500 > logs.txt &
# exit the session
tmux detach
# now u can close the terminal
tmux attach

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store