RLOS-2021-Microsoft

Contains updates for my work on Parallel parsing improvements on Vowpal Wabbit.

View on GitHub

How to run:

This page helps in creating and running the tests for this project.

Code:

Work on Parallel parsing was performed last year by Cassandra. My work this year builds upon the previous work. So, the updated branch (containing all the work done last year, and updated with the master branch(just compiles)) is the multithread_parser_with_passes. Hence, I’ve created pull requests against that branch for easy comparisions.

Currently, there are three active branches in nishantkr18/vowpal_wabbit:

Inactive branches (not updated):

The rest of the branches contain implementation of ideas that were eventually dropped.

Building:

Initially I used Debian(Ubuntu 20.04) for the project, which is easy to work on, given that all the dependencies and build instructions are provided on the vowpal wabbit github wiki.

Later, I shifted to Arch, for which I just had to install the additional cmake and boost libraries (can be installed using a simple pacman -S).

Tip: To avoid getting the annoying warning everytime you build VW, use: cmake ../ -DWARNINGS=OFF.

Benchmark dataset:

All benchmarks are carried out on a repeated sample of the dataset 0001.dat available in the repository tests here. We have used two variants of the dataset:

Commands used for testing:

A typical command used for testing and benchmarks is: time vowpalwabbit/vw ~/i2.dat --num_parse_threads=100 --passes=100 -c -k

Here I list the purpose of some important flags which you should use for your own tests:

Additionally, the timer_0001.sh file contains snippets to create benchmarks on the 0001 dataset variants for text, JSON and cache formats, including multiple passes.

Using callgrind:

The popular profiler Valgrind has this amazing tool called calgrind which helps visualize the total CPU time taken to run parts of the code. Try it out using:

valgrind --tool=callgrind vowpalwabbit/vw <data file>
kcachegrind <callgrind output file>

Note: For better source code tracking, use a debug build.