Introduction

In this article, I will present my solution for the Trustly admission test. The challenge consists of building an API with one route that, given a GitHub repository URL, returns the numbers of lines and bytes grouped by file extensions. But I was not allowed to use the GitHub API or web scraping libraries.

Besides that, the challenge requires that the API user should not receive timeout errors or long response times for subsequent requests.

Solution

My solution was a Typescript API, implemented using express.js and deployed on Microsoft Azure.

To scrap and extract info from the Github website, I used a HTML parser, that ingests raw HTML code and returns the DOM element tree, so I can run query selectors on it.

You can see the Github Repository here.

Guiqft/github-scraper-api

Algorithm

The big problem that I have to deal with transversing through directories.

First, I was trying to fetch each file on the current page to get his size, then open each directory and do it recursively until no more directories were found.

But by using this approach, the requests were done in a serial fashion, so, each file page request was started just when the last one is finished. This approach was making my API response time extremely high, so I needed to change the algorithm.

So, thinking about that, I found a new solution: Open all the repository folders first, push each directories' file list into a main big file list, and, when the directories were done fetching, split the big file list into batches of 100 files and run concurrent requests for all the batches file pages.

Deployment

To deploy the API, I first wanted to use the AWS Lambda + API Gateway to run it a serverless context, but I saw that depending on the repository size, the API response time took more than 30 seconds, that is the timeout limit for the API Gateway. So, with Lambda, user would receive a timeout error.

Then, I needed to choose another option to my api deployment. I already had heard about Kubernetes clusters, but I never use it. So I decided to learn Kubernetes deployment for the first time.

Cloud

First of all, to deploy my api, I had to choose a cloud provider. I searched which option let me do free Kubernetes cluster deployments and I faced it with Microsoft Azure.

I had never used another cloud that wasn't the AWS, so I jumped into this challenge.

Cache
Docker
Kubernetes

Hosting

After the Kubernetes deployment, I have a ip address to my api instace. So, to make it more readable, I created a new domain and DNS at Freenom, then, now my api endpoint looks better.

Conclusion

I would like to thank by the opportunity to learn new technologies like Kubernetes and dealt with new cloud providers. I loved the challenge porpuse and his limitations made me think about my choises and solutions. If I had more time, I would love to set Github CI/CD for my api, and make a beautiful front-end to consume it.