Category Archives: System








New common crawl data available

New crawl data is now available!  The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed). A huge corpus indeed.

The entire Common Crawl data set is stored on Amazon S3 as a Public Data Set:

Data Structure

New crawl data is located in the aws-publicdatasets bucket under the base path /common-crawl/crawl-data/ path.

Under this base path, crawl data is organized hierarchically as follows:

  • CRAWL-NAME-YYYY-MM – The name of the crawl and year + week# initiated on
    • segments
      • SEGMENTNAME – A segment directory, typically a unix timestamp
        • warc – contains the WARC files with the HTTP request and responses for each fetch
          • CRAWL-NAME-YYYMMMDDSS-SEQ-MACHINE.warc.gz – individual WAT files
        • wat – contains WARC-encoded WAT files which describe the metadata of each request/response
          • CRAWL-NAME-YYYMMMDDSS-SEQ-MACHINE.warc.wat.gz – individual WAT files
        • wet – contains WARC-encoded WET files with text extractions from the HTTP responses
          • CRAWL-NAME-YYYMMMDDSS-SEQ-MACHINE.warc.wet.gz – individual WAT files

Reducing GC and Faster Memory Allocation To Improve JVM Performance

TwitterNetty is a high-performance NIO (New IO) client server framework for Java that Twitter uses internally as a protocol agonostic RPC system. Twitter found some problems with Netty 3’s memory management for buffer allocations beacause it generated a lot of garbage during operation. When you send as many messages as Twitter it creates a lot of GC pressure and the simple act of zero filling newly allocated buffers consumed 50% of memory bandwidth.

Netty 4 fixes this situation with:

  • Short-lived event objects, methods on long-lived channel objects are used to handle I/O events.
  • Secialized buffer allocator that uses pool which implements buddy memory allocation and slab allocation.

The result:

  • 5 times less frequent GC pauses: 45.5 vs. 9.2 times/min
  • 5 times less garbage production: 207.11 vs 41.81 MiB/s
  • The buffer pool is much faster than JVM as the size of the buffer increases. Some problems with smaller buffers.

Given how many services use the JVM in their messaging infrastructure and how many services have GC related performance problems, this is in impressive result others may want to consider.

For more detail improvement please refer to this ppt

Linux Shell tricks

Send process to background:

'.wch_stripslashes('Ctrl + z').'

Move process to foreground:


Create an empty file:

'.wch_stripslashes('touch a.file').'

Execute commands from a file in the current shell:

'.wch_stripslashes('source /home/user/').'

Substring for first 5 characters:



SSH with pem key:

'.wch_stripslashes('ssh user@ip_address -i key.pem').'

Get complete directory listing to local directory with wget:

'.wch_stripslashes('wget -r --no-parent --reject "index.html*" http://hostname/ -P /home/user/dirs').'

Recursion create dirs:

'.wch_stripslashes('mkdir -p /home/user/dir1/dir2/dir3').'

Create multiple directories:

'.wch_stripslashes('mkdir -p /home/user/{test,test1,test2}').'

List processes tree with child processes:

'.wch_stripslashes('ps axwef').'

List war | jar file:

'.wch_stripslashes('jar -tf demo1.jar').'

Create war file:

'.wch_stripslashes('jar -cvf name.war file').'

Test disk write speed:

'.wch_stripslashes('dd if=/dev/zero of=/tmp/output.img bs=8k count=256k; rm -rf /tmp/output.img').'

Test disk read speed:

'.wch_stripslashes('hdparm -Tt /dev/sda').'

Get md5 hash from text:

'.wch_stripslashes('echo -n "text" | md5sum').'

Check xml syntax:

'.wch_stripslashes('xmllint --noout file.xml').'

Extract tar.gz in new directory:

'.wch_stripslashes('tar zxvf package.tar.gz -C new_dir').'

Get HTTP headers with curl:

'.wch_stripslashes('curl -I').'

Modify timestamp of some file or directory (YYMMDDhhmm):

'.wch_stripslashes('touch -t 0712250000 file').'

Generate random password (16 char long in this case):

'.wch_stripslashes('LANG=c < /dev/urandom tr -dc _A-Z-a-z-0-9 | head -c${1:-16};echo;').'

Quickly create a backup of a file:

'.wch_stripslashes('cp some_file_name{,.bkp}').'

Ubuntu no password login:

ssh-copy-id not-marco@remote.hosts
then ok :)').'


Update date from Ubuntu NTP server:


netstat show all tcp4 listening ports:

'.wch_stripslashes('netstat -lnt4 | awk '{print $4}' | cut -f2 -d: | grep -o '[0-9]*'').'

Convert image from qcow2 to raw:

'.wch_stripslashes('qemu-img convert -f qcow2 -O raw precise-server-cloudimg-amd64-disk1.img \

Run command repeatedly, displaying it’s output (default every two seconds):

'.wch_stripslashes('watch ps -ef').'

List all users:

'.wch_stripslashes('getent passwd').'

Mount root in read/write mode:

'.wch_stripslashes('mount -o remount,rw /').'

Mount a directory (for cases when symlinking will not work):

'.wch_stripslashes('mount --bind /source /destination').'

Send dynamic update to DNS server:

'.wch_stripslashes('nsupdate < <EOF
update add $HOST 86400 A $IP

Recursively grep all directories:

'.wch_stripslashes('grep -r "some_text" /path/to/dir').'

List ten largest open files:

'.wch_stripslashes('lsof / | awk '{ if($7 > 1048576) print $7/1048576 "MB "$9 }' | sort -n -u | tail').'

Show free RAM in MB:

'.wch_stripslashes('free -m | grep cache | awk '/[0-9]/{ print $4" MB" }'').'

Open Vim and jump to end of file:

'.wch_stripslashes('vim + some_file_name').'

Git clone specific branch (master):

'.wch_stripslashes('git clone -b master').'

Git switch to another branch (develop):

'.wch_stripslashes('git checkout develop').'

Git delete branch (myfeature):

'.wch_stripslashes('git branch -d myfeature').'

Git delete remote branch:

'.wch_stripslashes('git push origin :branchName').'

Git push new branch to remote:

'.wch_stripslashes('git push -u origin mynewfeature').'

Print out the last cat command from history:


Run your last cat command from history:


Find all empty subdirectories in /home/user:

'.wch_stripslashes('find /home/user -maxdepth 1 -type d -empty').'

Get all from line 50 to 60 in test.txt:

'.wch_stripslashes('< test.txt sed -n '50,60p'').'

Run last command (if it was: mkdir /root/test, below will run: sudo mkdir /root/test):

'.wch_stripslashes('sudo !!').'

Create temporary RAM filesystem – ramdisk (first create /tmpram directory):

'.wch_stripslashes('mount -t tmpfs tmpfs /tmpram -o size=512m').'

Grep whole words:

'.wch_stripslashes('grep -w "name" test.txt').'

Append text to a file that requires raised privileges:

'.wch_stripslashes('echo "some text" | sudo tee -a /path/file').'

List all supported kill signals:

'.wch_stripslashes('kill -l').'

Generate random password (16 characters long in this case):

'.wch_stripslashes('openssl rand -base64 16').'

Do not log last session in bash history:

'.wch_stripslashes('kill -9 $$').'

Scan network to find open port:

'.wch_stripslashes('nmap -p 8081').'

Set git email:

'.wch_stripslashes('git config --global ""').'

To sync with master if you have unpublished commits:

'.wch_stripslashes('git pull --rebase origin master').'

Move all files with “txt” in name to /home/user:

'.wch_stripslashes('find -iname "*txt*" -exec mv -v {} /home/user \;').'

Put the file lines side by side:

'.wch_stripslashes('paste test.txt test1.txt').'

Progress bar in shell:

'.wch_stripslashes('pv data.log').'

Send the data to Graphite server with netcat:

'.wch_stripslashes('echo "hosts.sampleHost 10 `date +%s`" | nc 3000').'

Convert tabs to spaces:

'.wch_stripslashes('expand test.txt > test1.txt').'

Skip bash history:

'.wch_stripslashes('< space >cmd').'

Go to the previous working directory:

'.wch_stripslashes('cd -').'

Split large tar.gz archive (100MB each) and put it back:

'.wch_stripslashes('split –b 100m /path/to/large/archive /path/to/output/files
cat files* > archive').'

Get HTTP status code with curl:

'.wch_stripslashes('curl -sL -w "%{http_code}\\n" -o /dev/null').'

Set root password and secure MySQL installation:


When Ctrl + c not works:

'.wch_stripslashes('Ctrl + \').'

Get file owner:

'.wch_stripslashes('stat -c %U file.txt').'

List block devices:

'.wch_stripslashes('lsblk -f').'

Find files with trailing spaces:

'.wch_stripslashes('find . -type f -exec egrep -l " +$" {} \;').'

Find files with tabs indentation:

'.wch_stripslashes('find . -type f -exec egrep -l $'\t' {} \;').'

Print horizontal line with “=”:

'.wch_stripslashes('printf '%100s\n' | tr ' ' =').'

UPDATE: November 2, 2013