Recap
After talking about statistical models like Geometric Brownian Motion and Ornstein-Uhlenbeck to create realistic synthetic price series for backtesting, we're almost done with the data-part for our research lab.
If you didn't already, you can check out the article here.
Using randomly generated synthetic data only gets us so far. At some point, we'll need to consult real market data to at least confirm that our model is working as intended and tackle things like portfolio optimization. So this week we're going to set up a comprehensive database of free real world market data for 10k+ crypto tokens.
Table of Content
Recap
Disclaimer
Infrastructure Powered by Docker
Mixing Via docker-compose
Isolation & Modularity
Infrastructure Quick Summary
A Word About Survivorship-Bias
Installing Docker & docker-compose
Cloning & Running The Datalab
Next Week
Disclaimer
If you had a DevOps department they would probably scream at you for this setup. And they would be right to do so! It's horrible. But that's fine! At this point in time, your setup can be horrible. The initial EOD data fetch is slow as hell (~24h) and since you end up with about 7+ million rows of data, operations on the database are also going to be slow(ish). You don't really want to wait minutes for data retrieval when running simulations. Things like (re)indexing the database or refactoring to more efficient schemes will help speed up things.
Right now though it's just a very basic prototype to build some kind of datahub for our lab so we can continue developing the trading strategy. We can always come back later and change things.
In fact, we're always going to do it like that:
- hack together some things until we think they work like we want them to
- test them (even automate the testing)
- fix things that weren't working as intended
- refactor the infrastructure if it's too slow
- implement the next thing
- repeat
Later on we will transform this into a real service that keeps running around the clock, continuously fetching prices across different frequencies for all kinds of financial instruments. We might even extend it to scrape alternative data too. But for now there's no need to make it any more complicated than it already is. We're not shipping this as a product, it won't ever hit the real world in this form and we're also not going to rely on it 100% to make our decisions.
This bitesized approach has proven to be the most effective one when working with people without a strong background in IT. There's nothing wrong with making it scalable if you already know how to do so but we're going to take the long route because otherwise it might be too much to digest for everybody.
Infrastructure Powered by Docker
To build and run our lab we're going to use Docker and docker-compose.
Docker is an open-source tool that automates and simplifies the deployment, scaling and management of applications by using so called containers. These containers package your application and all its dependencies together so they can be set up and run on any machine, no matter the operating system. You can think of containers like virtual machines, only that they are much more lightweight. While virtual machines usually contain full blown operating systems to make your code run, docker containers only include what's actually needed. Each container is isolated and portable. They can run on anything that has Docker installed, including various cloud platforms.
Dependencies needed to run your application are handled in a Dockerfile, which automatically pulls them in during the build stage. The following Dockerfile installs all OS and R specific dependencies needed to run the data scraper for our lab before copying over the source code and then executing it. It will be read and executed everytime we start up our lab.
# Use the official R image
FROM r-base:latest
# Install dependencies
RUN apt-get update && apt-get install -y \
libcurl4-openssl-dev \
libfontconfig1-dev \
libharfbuzz-dev \
libfribidi-dev \
libfreetype6-dev \
libpng-dev \
libtiff5-dev \
libjpeg-dev \
libpq-dev \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
# Install packages
RUN R -e "install.packages('crypto2')"
RUN R -e "install.packages('RPostgres')"
RUN R -e "install.packages('dplyr')"
# Copy the R script into the container
COPY fetch_data.r /usr/local/bin/fetch_data.r
# Set the working directory
WORKDIR /usr/local/bin
# Run the script
CMD ["Rscript", "fetch_data.r"]
Docker comes with its own CLI and its command palette is straightforward. To build and run containers from a Dockerfile you simply type in
docker build -t myapp:tagname .
where . is the directory to your Dockerfile and sourcecode and myapp:tagname some multilevel tag you specify based on versioning to build the image first. After that you can spin up the application with
docker run -d --name myapp-container -p 8080:80 myapp:tagname
to serve it over port 8080 on your local machine, which redirects all the traffic to port 80 inside the container.
Although there are many more useful commands and flags/switches for the docker CLI, we're not going to talk about them today. The above Dockerfile really only defines one service: the data scraper, which handles fetching historical data for crypto tokens. We don't persist this data anywhere yet. For this we're going to set up a PostgreSQL database.
Don't worry! It's going to be so easy, you wouldn't even believe it. Escpecially if you've been doing it manually on your machines up until now.
Mixing via docker compose
Setting up multiple services through Docker comes with the drawback that you have to build, run and manage all of them separately via the CLI.. or does it?! docker-compose defines a tidy and slick interface to tailor and manage multiple services together inside a docker-compose.yml file.
The following configuration defines two services: the database (db) and our data-scraper, each with their own isolated dependencies and underlying software stack. It uses the above Dockerfile to build the data scraper we just configured, which is an R application that makes outgoing HTTP requests to the free coinmarketcap.com API while db is just a PostgreSQL database. The .yml file also defines and mounts a volume - think of it as its own filedrive for now - for the data we want to store in that database. Since we need to connect and authenticate against the database, we're defining environment variables and after reading them from an .env file push them through to the containers at build time.
services:
db:
image: postgres:latest
environment:
- POSTGRES_USER=${DB_USER}
- POSTGRES_PASSWORD=${DB_PW}
- POSTGRES_DB=${DB_DB}
ports:
- "${DB_PORT}:${DB_PORT}"
volumes:
- postgres_data:/var/lib/postgresql/data
data_scraper:
build:
context: ./data_scraper
dockerfile: Dockerfile
environment:
- DB_USER=${DB_USER}
- DB_PW=${DB_PW}
- DB_DB=${DB_DB}
- DB_PORT=${DB_PORT}
depends_on:
- db
volumes:
postgres_data:
This is the .env file, its content is read by both our services during build time, which makes them available inside the containers via things like Sys.getenv():
DB_USER=postgres
DB_PW=password
DB_DB=postgres
DB_PORT=5432
DISCLAIMER: This is just an example! Always use secure credentials for your environment variables (default credentials are not secure). Never store sensitive information such as passwords or API keys in your source code. Make sure to add your .env
files to your .gitignore
to prevent them from being tracked by version control.
Instead of building the images for each service and then running them independently, we can simply type
docker compose up --build
into our terminal. It'll handle everything and notify us when our application is ready. Visual Studio Code has an integrated terminal you can control while also viewing your sourcecode.
,
To stop the application we can hit CTRL + C. If we want to rebuild our images due to changes in our code, we can clean up any residuals by using
docker compose down --remove-orphans
and then restart it with
docker compose up --build
Remember we basically just copied over the sourcefiles into the containers at build time so a little bit of juggling is needed to replicate the same experience you get when coding and testing locally.
We could also just mount our sourcecode into the container to achieve so called hot-module-reloading (HMR), which semi automatically detects changes and restarts our application if needed to reflect them. But since we're not really actively developing today, we're not going to do that. Don't hesitate to contact us if you need more info about HMR!
Isolation & Modularity
Our docker stack nicely encapsulates and isolates concerns and behaviour by creating containers that only contain stuff they need to run the underlying software. Another benefit of using Docker is its ease of mixing and matching different types of techologies to suit your needs.
There's no need for the PostgreSQL instance to know anything about how to run R code and vice versa. The data-scraper service doesn't need to know which database is installed. It only needs to know how to connect and write to a generic SQL datastore interface.
If we wanted to switch out our R data-scraper for a python application, we wouldn't need to keep track of the libraries needed to run R code anymore. For security reasons it's always advised to only have libraries installed that you actively use to not fall victim to unknown 0-day exploits. Uninstalling them manually would be a rather boring and painful task.
With our current setup we can just specify a new service, source it from python:latest and stub it in for our R image. The same is true for the database. If we ever wanted to switch we can just use another driver and specify the same credentials to set it up while the rest is handled by Docker.
In the future we're going to add some kind of webserver - probably python-flask or a quick and dirty custom made net/http golang application - to make retrieving and writing data from the DB even easier. It'll provide a nice interface for us to make HTTP requests against to fetch price data by hitting http://localhost from our local research scripts, which can be anything we want: python, ruby, c++, R, excel, it doesn't matter! By decoupling the data retrieval from a database interface to a more generic one, we give ourselves a lot of room and flexibility.
Infrastructure Quick Summary
That may have been a lot to absorb, so let's summarize it: we're setting up a dockerized environment with currently two services. One of the services - the data-scraper - is responsible for sourcing crypo market data and uses the R/crypto2 library to do so. The other service encapsulates the inner workings of a PostgreSQL database. Each service knows basically nothing about the underlying technology of their counterpart. Apart from using SQL that is. We can take it a step further and layer a generic HTTP interface for simple database CRUD actions on top of the db.
We also mount a portable docker-volume into the db service so we have something to store our data in. By passing in credentials as environment variables we enable the containers to communicate with each other in a meaningful way. Docker itself creates its own network to handle communication between the containers but we can also access them via localhost if we wanted to.
This modular approach of different containers for different concerns allows for easy changes and scalability by isolating the services and their dependencies. Switching technology or extending the stack involves minimal changes, maintaining flexibility and security while also facilitating iterative development and experimentation without excessive complexity. The full stack can be started and stopped by using one single command docker compose.
The code and files needed - including the R datascraper - can be found in this weeks GitHub repository. If you still have any questions, contact us and we're happy to help!
We're now going to deploy this thing!
A Word About Survivorship-Bias
The data we're pulling in is somewhat survorship-bias free. Upon inspecting it, we found multiple tokens that ended up with a continuous price series of 0 from a certain point on and verified their delistings but for some tokens there wasn't data available. Later down the line we need to hit proper survivorship-bias free data. But there's no need for it right now. Data providers like Tardis are quite expensive and currently we want to keep focusing on developing our strategy, which might take some time. For now we have enough real world dummy and synthetic data to work with. When, it's time we're going to utilize the same scraping technique but for higher quality data and just swap out the scraper service.
Installing Docker & docker-compose
If we want to use Docker and docker-compose we need to install it first. The installation process is a little bit different for each operating system but in every case pretty straightforward.
This tutorial focuses on setting everything up using Ubuntu. A manual for each OS can be found here. The Docker Engine is available on linux through docker-ce. For Windows and Mac through Docker Desktop.
If you didn't already, it might be a good idea to install VSCode to follow along. It's our editor of choice and it'll make it much easier to replicate all the steps. You can use whatever editor or IDE you want though.
DANGER ZONE: If you already have a working Docker environment and projects in it, you don't need to do any of this. The following actions are not reversible and destroy your existing environment. Only follow along if you intend to set up Docker from scratch.
We're going to start by deleting unofficial packages that might have been shipped with your OS by running:
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
inside our terminal. Old images and containers aren't automatically removed when uninstalling Docker so we clean them up manually with
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
sudo rm /etc/apt/sources.list.d/docker.list
sudo rm /etc/apt/keyrings/docker.asc
SAFE ZONE AGAIN
Next we need to set up the Docker apt repository to be able to install and update Docker from it:
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
Now we can install the latest Docker version and dependencies:
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
After that we need to add our current user to the docker group
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
and verify that everything is working by running
sudo docker run hello-world
which pulls in a hello-world application and then runs it inside your terminal.
Cloning & Running The Datalab
We can now download the above setup by cloning this weeks GitHub repo:
git clone https://github.com/KatanaQuant/newsletter.git
and then open it in VSCode. You can either open the full repository and then change the working directory manually to Issue 9 or just open the Issue 9 directory.
The only thing left to do now is creating an .env file containing the PostgreSQL credentials
DB_USER=postgres
DB_PW=password
DB_DB=postgres
DB_PORT=5432
and then running docker compose up --build, which will build the application and start scraping the historical data for you.
If you want to inspect the data after it got written to the database, you can either use GUI tools like dbeaver or just connect via terminal over localhost. For you to be able to connect to the database it has to be started via docker first. To avoid refetching insane amounts of data after the initial fetch is done, you can simply comment out the scraper in the docker-compose.yml file for now:
Next Issue
We're going to layer a web interface on top of the database to make reading from it easier. If we need to, we will also refactor the database scheme and show you how to securely migrate the changes to make it more efficient and speed up transactions. We encourage you to experiment and try adding a data-reading service yourself using the above docker-compose approach. If you can't quite get it to work via Docker, you can also try to access the db via local python scripts connecting to localhost just like with dbeaver and work your way up from there.
In any case, happy coding!
- Hōrōshi バガボンド
Newsletter
Once a week I share Systematic Trading concepts that have worked for me and new ideas I'm exploring.