Migrating from Bitbucket

Hello!

I’m working to migrate a subversion <-> git mirror out of self-hosted Atlassian Bitbucket (where we use the subgit plugin) and onto a standalone subgit installation. For the git repo, we’ll be migrating from Bitbucket to Phabricator; since Phabricator is similar to Github and has a commit database on top of the git repo itself, we’ll be following the Github setup guide:

https://subgit.com/documentation/github.html

Anyway, I have two questions:

(1) The initial subversion -> git import takes a very long time, owing to the size of the repo. In the past when the mirror has broken and svn/git histories diverged, within Bitbucket we’ve had to reimport the git repo from svn, which freezes the git repo for a long time. On the current standalone version (3.3.10), if both svn and git commits continue while mirror operations are down for an extended period, is subgit sophisticated enough to Do The Right Thing™ and re-merge commit histories seamlessly, or does this require a fresh reimport from one of the sources, and we’d lose deltas committed in the interim to the other?

(2) As we’re already a paying customer and will continue to be a paying customer (simply switching from the Atlassian Bitbucket plugin to the standalone daemon), I was wondering if it might be possible to request an extension to the 7-day trial. We haven’t hit the 7-day limit, yet, but I wanted to inquire in advance of that eventuality as we continue to test the migration path.

Thanks so much!

David Goerger
Systems Engineer
Hudson River Trading, NYC

he/him/his

Hello David!

If the mirror between Git and SVN repositories is down and new commits are made both to SVN and to the Git repository, then I’m afraid, SubGit won’t be able to pick up the mirror and continue by merging all the commits made on both sides as it is diverged-history situation. If the mirror is re-established after a period of inactivity and the SVN and Git repositories history has diverged (that is, new commits were made both on Git and SVN side) then SubGit will move new Git commits (those that are not present in SVN) to ‘unsynced’ namespace thus resetting the Git repository state to the last synchronized state, and then will translate all new SVN revisions to the Git repository thus restoring the synchronization. The new Git commits will still be available in the repository, so it will be possible to merge them manually after the mirror is restored.

Sure, it’s perfectly possible to extend the trial period to 30 days, you would just need to install 30-days trial key. It can be obtained on the following page:

https://subgit.com/pricing

Select " Trial" license kind. After sending the form on the website, you will receive and email with the key and to extend the trial period a mirrored repository should be registered with that key. To register the repository, run the following command:

subgit register --key <path to key file> <path to mirrored repository>

This command require root privileges as it writes to system directories.

Dear TMate,

I’m having some trouble setting up a bidirectional svn<->git sync. I’ve been able to create what appears to be a one-way sync from svn->git, but if I start over and follow the quickstart guide, I get the following error message:

$ /usr/scratch/subgit/bin/subgit configure --layout auto --svn-url svn+ssh://shasubgit1.hudson-trading.com/usr/scratch/svn /usr/scratch/config4.git
SubGit version 3.3.10 (‘Bobique’) build #4368

Configuring writable Git mirror of remote Subversion repository:
Subversion repository URL : svn+ssh://shasubgit1.hudson-trading.com/usr/scratch/svn
Git repository location : /usr/scratch/config4.git

Detecting peg location…
Authentication realm: svn+ssh://shasubgit1.hudson-trading.com
Username: dgoerger
Password for ‘shasubgit1.hudson-trading.com’ (leave blank if you are going to use private key):
Private key for ‘shasubgit1.hudson-trading.com’ (OpenSSH format): /home/dgoerger/.ssh/id_ed25519_subgit_test_20210120
Private key passphrase [none]:
Port number for ‘shasubgit1.hudson-trading.com’ [22]:
Authentication realm: svn+ssh://shasubgit1.hudson-trading.com
Username: dgoerger
Password for ‘shasubgit1.hudson-trading.com’ (leave blank if you are going to use private key):
Private key for ‘shasubgit1.hudson-trading.com’ (OpenSSH format): /home/dgoerger/.ssh/id_ed25519_subgit_test_20210120
Private key passphrase [none]:
Port number for ‘shasubgit1.hudson-trading.com’ [22]:
Authentication realm: svn+ssh://shasubgit1.hudson-trading.com
Username: dgoerger
Password for ‘shasubgit1.hudson-trading.com’ (leave blank if you are going to use private key):
Private key for ‘shasubgit1.hudson-trading.com’ (OpenSSH format): /home/dgoerger/.ssh/id_ed25519_subgit_test_20210120
Private key passphrase [none]:
Port number for ‘shasubgit1.hudson-trading.com’ [22]:

CONFIGURATION FAILED

error: svn: E200015: Authentication cancelled
error: Unexpected error has occurred; please report along with the logs (’/home/dgoerger/subgit-configure-20210120-141649.zip’)
error: to http://issues.tmatesoft.com/, thank you!

The key I created for testing does not have a passphrase—although I’ve tried it both ways, with and without a passphrase, in case it made a difference (it did not).

I’m attaching the uncompressed log file. Can you advise how to proceed? I’m confused why it appears to say authentication cancelled when no ssh login attempt is recorded in /var/log.

If it matters, our SVN repo is only accessible via ssh and file protocol (i.e. no http/https). I was able to create a mirror svn->git by configuring only the path to the raw SVN repo, which created e.g. /usr/scratch/svn/conf/subgit.conf, but did not create an analogous /usr/scratch/config.git/subgit/config directory tree. If we have to set up the sync “by hand” with svn+git post-commit hooks, that’s fine, but I’m concerned insofar that I’m not able to follow the official instructions as written.

Thanks so much!

David Goerger

he/him/his

subgit-configure.0.log (187 KB)

Hello David,

thank you for reporting that issue!

Looks like we have an issue with ssh, we will investigate that. You mentioned, however, that the SVN repository is on the same server and available over file protocol, is that correct? If so, then I think it would even be better to use file protocol instead of svn+ssh, like:

subgit configure --layout auto --svn-url file:///usr/scratch/svn /usr/scratch/config4.git

this would allow establishing a mirror.

Hello David,

in addition to my previous message.
It looks as if the issue is caused by an incorrect key: SubGit restarts this authentication process if it discovers that the provided username and key pair is incorrect for the given SVN URL. In this case it asks for the username and key path two more times and then failed with the “E200015: Authentication cancelled”, exactly in the way it behaved in this case. So I’d like to ask if that is correct key that is added to authorized_keys for user dgoerger? If yes, have you tried to use it in conjunction with native svn client and did it work well?

Thanks Ildar! The file:// URI works beautifully!

The ssh pubkey is in ~/.ssh/authorized_keys for the target user, so I’m not sure why that doesn’t work (indeed it doesn’t show up in the sshd logs as even trying to connect), but in any case operating on the raw file path will be faster than looping through the network stack, so I think we’re ok there.

I did have another question though. The repository we wish to mirror has more than a thousand tags, and more than a quarter million commits. We don’t actually care to mirror the old tags or even the old commits per se—starting the git mirror from HEAD and ignoring past commits would be sufficient for our purposes. If we didn’t need the mirror operation (some commits occur via svn, some via git), the simple solution would be to “cd” into an svn checkout, type “git init . && git add . && git commit -m ‘initial commit’ && git push” and be done with it. Alas, we do need the mirror.

In an effort to speed up the initial mirror creation process, I used layout std instead of auto, which ignores the tags and only picks up two branches, and passed --minimal-revision=$LATEST_REVISION. With that, “subgit configure” takes a few seconds to complete. Then I can edit repo.git/subgit/config, drop the branch I don’t care about, copy in authors.txt, and start the “subgit install” process.

By setting the minimal revision, I expected the translation would only need to mirror HEAD, but indeed the install process to create the initial mirror still took days (plural), and bpftrace(8) revealed that the subgit translation proc appears to be running the open(2) syscall against every historical commit hash in the tree. Obviously running open() more than a quarter million times will take a fair amount of time, and I’m not sure why it’s necessary. top(1) suggests that the operation is NOT i/o bound, for what it’s worth, but if we could skip historical commits altogether, that would seem ideal (unless I’m overlooking something).

In order to safely cut over from our Bitbucket + SubGit plugin setup to a standalone copy of SubGit with as brief of downtime as possible, do you have any recommendations for how to speed this up? We’re already running on local disk -> local disk (no nfs/cifs), raw filesystem access (no network drivers), and again we only need to translate HEAD.

Thanks so much!

David

Hello David,

glad to know that you got it working!

SubGit is actually developed in the way you assumed, it does not try to get the whole repository history being given minimalRevision, it just gets the minimal revision and then starts importing subsequent revisions. It may try to reach some old revisions in case if excludeBranches is set, but not all anyway, so it’s pretty strange if you see it’s trying to access the history data much. Do you see the same in SubGit logs? Could it be possible to share logs with us to analyze?

Dear Ildar,

Apologies for the delay, I had to re-run the migration to fetch new logs due to unrelated nfs filer maintenance which prevented the previous sync I’d mentioned from ever finishing in the first place. (I’d had to ^C it before the NAS was taken offline for an upgrade.)

I actually re-ran the sync twice, once each on distinct hardware, to confirm results. I’m attaching the logs from the second run (faster physical hardware, local ssd operation only), initiated Thursday, which took 32 hours, 8 minutes, 56 seconds to complete. While both syncs were running, I was not able to identify any obvious resource bottlenecks from top, iotop, RAM usage, nor was the system swapping at any point. The subgit mirror operation was indeed the only process running on the hosts, besides the usual services from a vanilla Debian install.

Steps taken:

  1. /usr/scratch/subgit/bin/subgit configure --minimal-revision 342183 file:///usr/scratch/svn /usr/scratch/config1.git
  2. vi /usr/scratch/config1.git/subgit/config
  3. cp -p ~/scratch/svn-authors.txt /usr/scratch/config1.git/subgit/authors.txt
  4. time /usr/scratch/subgit/bin/subgit install /usr/scratch/config1.git
    I’m attaching /usr/scratch/config1.git/subgit/config (subgit.config) and /usr/scratch/config1.git/subgit/logs/daemon.log (xz-compressed). There are also the autogenerated zip bundles, but I’m unsure how to transmit such a large log file.

The subgit-install-0.log file, which is 12 gigabytes uncompressed, includes many lines of

“text” and “eol” are set to “UNSET” and “UNDEF” because svn:eol-style was changed to null or the file was added; and svn:mime-type was changed to null or the file was added

Thanks so much in advance for any pointers you may have for us!

David

subgit_daemon_shasyslab3.log.xz (533 KB)

subgit.config (10.1 KB)

Dear David,

thanks for you response and logs.

I was hoping to get the information about the initial import process, however, to check if the import process hits old revisions. I’m afraid, the daemon log is not much of a help for that purpose as it only contains information about running mirror – in fact, this particular daemon log only contains periodic SVN checks without even any translation process, so I’m afraid, we would need the ‘install’ log to check the initial import itself.
Judging from the configuration, the EOLs translation is enabled (which is default configuration) and that’s why those lines appear in the logs. The EOLs translation may indeed take a long time and it may worth to switch it off as it’s described in this our article:

https://subgit.com/documentation/faq.html

and there are also a couple more settings that may be useful to speed up the import.

Dear Ildar,

Thanks! I’ll try disabling EOL translation. In order to submit the install log, do you have an upload site where I can submit a 330 megabyte file? (12 gigabyte uncompressed -> 330M compressed with xz -9, or 642M using regular zip).

Thanks again,

David

Dear Ildar,

I spoke with my supervisor and we were able to create a public upload on google drive:

https://drive.google.com/file/d/1PyiZu7t_evTaTfwJLtDizKPs_HEeUHT8/view?usp=sharing

Please let me know when you’ve been able to download a copy and we will delete it from the public file share.

Thanks so much!

David

Hello David!

Thank you for the log.

I haven’t found any signs that SubGit tries to access older revisions, however. The minimal revision is set to 342183:

minimalRevision = 342183

and SubGit actually starts the translation starting from that very revision:

[2021-02-04 16:17:20.381][subgit-install][1] SET_PATH '' 342183 not empty depth=infinity

[2021-02-04 16:17:20.408][subgit-install][1] fetching: branch = refs/svn/root/trunk, revision = 342183, receivedFileCount=0

I’d like to note, however, that the minimal revision 342183 is relatively far from the latest one which is revision 344112, so SubGit is actually translates about 2000 revisions, not a single one.

Also, you once mentioned that tags were excluded to speed up the translation, but I found that wasn’t the case in this attempt:

 trunk = trunk:refs/heads/master
 branches = branches/*:refs/heads/*
 tags = tags/*:refs/tags/*
 shelves = shelves/*:refs/shelves/*

This configuration actually captures all the branches and all the tags, so if they are not needed, I think, it may worth to exclude or limit number of imported tags and branches, or to start from a later revision.

And I agree that there are many EOLs-related lines in logs and that may also introduce noticeable delay in the translation. so it may worth to switch EOLs translation off to speed up the translation.

Dear Ildar,

Another question if I may:

I’ve configured subgit’s “core.shared=true” option and re-run subgit install, as well as set chmod g+swX on the git repo itself to ensure group write permissions, but I’m still getting this error when the subgit daemon is (1) started as User1 and then (2) User2 (in the same primary unix group “users”) tries to push to the repo:

$ git push
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 4 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 326 bytes | 326.00 KiB/s, done.
Total 3 (delta 2), reused 1 (delta 1)
remote: UNRECOVERABLE ERROR:
remote: Missing unknown f79d4c92b9c55299a9c03a02444e8ca94eaf577f
remote:
remote: CURRENT STATE:
remote: Git object f79d4c92b9c55299a9c03a02444e8ca94eaf577f can’t be accessed in the Git repository /usr/scratch/config.20210330.git.
remote:
remote: POSSIBLE REASONS:
remote: 1) Git server is out of memory, available disk space,
remote: processes count, or open files limits;
remote: 2) When pushing through SSH, the system user on behalf of which.
remote: SubGit daemon is running is not allowed to read files created by SSH user.
remote: TO RECOVER:
remote: 1) Make sure Git server has enough memory, disk resources AND high enough
remote: open handler and processes count limits.
remote: 2) When pushing through SSH, either
remote: A) make sure the SSH user is the same user on behalf of which the SubGit is running
remote: AND
remote: is the same as the system user which is the owner of the Git repository; OR
remote: B) set ‘core.shared’ option to ‘true’ in SubGit configuration file, run
remote: $ subgit install /usr/scratch/config.20210330.git
remote: to apply changes and make sure the owner of the Git repository and SSH user
remote: belong to the same system group.
remote: For details regarding sharing Git repository for a system group see:
remote: https://git-scm.com/docs/git-config#Documentation/git-config.txt-coresharedRepository
remote:
remote: Once you have (1) and (2) fixed, retry push.
remote:
remote: TO REPORT:
remote: 1) Get error log from the server at
remote: ‘/usr/scratch/config.20210330.git/subgit/logs’
remote:
remote: 2) Report an issue at http://issues.tmatesoft.com/
remote:
remote: THANK YOU!
To ssh://shasubgit1.hudson-trading.com/usr/scratch/config.20210330.git
! [remote rejected] master → master (pre-receive hook declined)
error: failed to push some refs to ‘ssh://shasubgit1.hudson-trading.com/usr/scratch/config.20210330.git
$

The logfile doesn’t contain any more information than what’s in this git trace.

I can work around this by configuring the git post-commit hook to close the subgit daemon after commit (if I do that, I have confirmed this is sufficient to allow both users to commit to the shared repo, ruling out a problem with the git repo or files itself), but then we don’t have anything polling the subversion repo to ensure commits to subversion are automatically translated into git.

I can think of a few possible workarounds, but am not sure if there’s something obvious necessary to get “core.shared=true” working correctly. I’ve re-read the documentation for both subgit and git, and I’m pretty sure it’s configured correctly at least per the documentation. And as I said, the git repo is writable by all members of the group if and only if the subgit daemon isn’t running at the time of the commit—which is why I think there’s something wrong with the subgit setting. Attached is the subgit/config file.

Even with this working as documented, however, I’m concerned that commits to the subversion repo won’t be automatically translated until a commit is made to the git repo. Is there a way to run this as a proper system daemon under systemd, or add an analogous post-commit autotranslation hook to the svn repo?

Thanks!
David

subgit.config.20210330 (10.2 KB)

Hello David,

this approach was indeed applicable and worked some years ago, but Git team has introduced the quarantine feature in 2.11 and this feature makes SubGit fail in such a setup, indeed. The matter is that the quarantined objects are being created with particular set of permissions that lead to issues like those you faced with. Nonetheless, the ‘ssh’ setup is possible yet I would not recommend creating different users for every developer, but creating only one user, like ‘git’, and use this used for the ssh communication like it’s described in this article:

Git - Setting Up the Server

This setup will work perfect even with core.shared=false and it is not affected by the quarantine, so it looks to be a setup you need?