The white rabbit repository has grown too big (#4) · Issues · Projects / White Rabbit

The white rabbit repository has grown too big

Hi,

For those who don't know me, I'm administrating the ohwr.org site.

Currently it takes ~3 hours to make a full clone of the project's repo (assuming a decent connection, on a slow one it can take ages).

Javier told me some days ago to look into it. I have several suggestions, but all of them require changes from your part, so I've decided to create an issue to start a discussion.

My suggestions are, in order of importance:

Split into two repos: "archive" vs "work".

A big chunk of the data downloaded is historical data, not the most current one. That history has interest in certain cases, but not always, and not for everyone. We could create an "archive" repo, for history-related purposes, and a "work repo", with no history, and a snapshot of the current repo state. The "work repo" should be significantly faster and smaller.

Drawbacks*: There are several ways of doing this, each one with its own advantages and problems. But all of them would require that anyone working actively with the white-rabbit repo (not just doing historical work) would have to refresh the repo - probably erasing it and redownloading (it would be smaller).

Move to git

Git is faster and smaller than SVN, by all accounts. If my memory serves, javier told me that he didn't want to go that way, since lots of the repo users where comfortable with svn. If this is still the case, then this point is moot.

I also think that exporting the whole repo to git will probably overflow any machine; it's just too huge. My recommendation would be moving to git while doing the work/archive split mentioned above (since everyone would have to refresh their repos anyway)

Drawbacks*: Still not possible if someone needs SVN

Cleanup, especially of big binary files

The bulk of the size in the white-rabbit repo is occupied by big binary files (PDFs and the like). Some of them tenths of MBs in size.

Neither SVN nor git are optimized for handling big binary files, but source code. Source code tends to be smaller, and also it has local similarities, which allows to store "diffs" instead of complete files. That strategy does not always work in binary files; a seemingly simple change, like adding a comma or removing a space, can change the data locally so much that a diff is not possible; a full copy of the document is stored instead. So we could spend 20MB when, for example, a font color is changed in a 50MB pdf.

My recommendation is removing from the work repo any non-essential binary files. If a binary can be generated from a source file, then the binary should not be stored in the repo (instructions or scripts for generating it should). I would go as far as to .gitignore all the generated files, and maybe adding a script to generate all them from source; it would probably take less time than downloading them (with all their history) anyway.

Binary files that don't change so often (manuals, posters, etc) could be stored in the Documents tab instead of the source control.

Drawbacks*: Someone has to go and list all the non-essential binaries. If they can be regenerated, then scripts/instructions for doing so must be written.

These are my suggestions for now.

Javier: please feel free to comment or reassign this to anyone you feel would be able to help here.

Hi,

For those who don't know me, I'm administrating the ohwr.org site.

Currently it takes ~3 hours to make a full clone of the project's repo
(assuming a decent connection, on a slow one it can take ages).

Javier told me some days ago to look into it. I have several
suggestions, but all of them require changes from your part, so I've
decided to create an issue to start a discussion.

My suggestions are, in order of importance:

## Split into two repos: "archive" vs "work".

A big chunk of the data downloaded is historical data, not the most
current one. That history has interest in certain cases, but not always,
and not for everyone. We could create an "archive" repo, for
history-related purposes, and a "work repo", with no history, and a
snapshot of the current repo state. The "work repo" should be
significantly faster and smaller.

*Drawbacks**: There are several ways of doing this, each one with its
own advantages and problems. But all of them would require that anyone
working actively with the white-rabbit repo (not just doing historical
work) would have to refresh the repo - probably erasing it and
redownloading (it would be smaller).

## Move to git

Git is faster and smaller than SVN, by all accounts. If my memory
serves, javier told me that he didn't want to go that way, since lots of
the repo users where comfortable with svn. If this is still the case,
then this point is moot.

I also think that exporting the whole repo to git will probably overflow
any machine; it's just too huge. My recommendation would be moving to
git while doing the work/archive split mentioned above (since everyone
would have to refresh their repos anyway)

*Drawbacks**: Still not possible if someone needs SVN

## Cleanup, especially of big binary files

The bulk of the size in the white-rabbit repo is occupied by big binary
files (PDFs and the like). Some of them tenths of MBs in size.

Neither SVN nor git are optimized for handling big binary files, but
source code. Source code tends to be smaller, and also it has local
similarities, which allows to store "diffs" instead of complete files.
That strategy does not always work in binary files; a seemingly simple
change, like adding a comma or removing a space, can change the data
locally so much that a diff is not possible; a full copy of the document
is stored instead. So we could spend 20MB when, for example, a font
color is changed in a 50MB pdf.

My recommendation is removing from the work repo any non-essential
binary files. If a binary can be generated from a source file, then the
binary should not be stored in the repo (instructions or scripts for
generating it should). I would go as far as to .gitignore all the
generated files, and maybe adding a script to generate all them from
source; it would probably take less time than downloading them (with all
their history) anyway.

Binary files that don't change so often (manuals, posters, etc) could be
stored in the Documents tab instead of the source control.

*Drawbacks**: Someone has to go and list all the non-essential
binaries. If they can be regenerated, then scripts/instructions for
doing so must be written.

These are my suggestions for now.

Javier: please feel free to comment or reassign this to anyone you feel
would be able to help here.