Branch mapping with wildcard and excludeBranchs not working

Hi,

we tried to map some SVN tags into Git branches in an already existing SubGit managed repository.

We have the following structure, dating all the way back to 2014 but we are only interested in the most recent ones:

<SVN-ROOT>/tags/nightly-build/2015-09/01-0000/*/

See screenshot for illustration:

We first tried to exclude all the old branches we are not interested in by using the following config:

excludeBranches = tags/nightly-build/2014*
excludeBranches = tags/nightly-build/2015*
excludeBranches = tags/nightly-build/2016*
excludeBranches = tags/nightly-build/2017*
excludeBranches = tags/nightly-build/2018*
excludeBranches = tags/nightly-build/2019*
excludeBranches = tags/nightly-build/2020*
excludeBranches = tags/nightly-build/2021*
excludeBranches = tags/nightly-build/2022-0*
excludeBranches = tags/nightly-build/2022-10

which unfortunately did not work. The SubGit install log still showed that it tried to retrieve content from the excluded paths.

We also tried excludeTags, which according to the documentation should do the same, but also did not work.

As there are over 3000 such tags folders, we don’t want to specify them all explicitly

After that, we tried defining only the parts we are interested by using the following mapping. As you can see we are also interested in the ones that will be created in the future, as long as we are using SubGit for our migration process.

branches = tags/nightly-build/2022-11/*/product:refs/heads/nightlies/2022-11/*
branches = tags/nightly-build/2022-12/*/product:refs/heads/nightlies/2022-12/*
branches = tags/nightly-build/2023*/*/product:refs/heads/nightlies/2023*/*

This unfortunately also does not work as the SubGit install log shows a lot of unrelated paths and weirdly enough, shows them multiple times.

[2022-11-18 14:02:53.244][subgit-install][1] Checking path "tags/nightly-build/2020-11/26-0331/XXX/XXXX/XXXXXXXXXXXXXXXXXXXXXXXX/XXXXXXX" for changes related to the layout
[2022-11-18 14:02:53.244][subgit-install][1] The changes are unrelated
[2022-11-18 14:02:53.244][subgit-install][1] Checking path "tags/nightly-build/2020-11/26-0331/XXXX/XXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXX/XXXXXXX" for changes within layout
[2022-11-18 14:02:53.244][subgit-install][1] It is not inside the layout
[2022-11-18 14:02:53.244][subgit-install][1] Checking path "tags/nightly-build/2020-11/26-0331/XXXX/XXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXX/XXXXXXX" for changes related to the layout
[2022-11-18 14:02:53.244][subgit-install][1] The changes are unrelated
[2022-11-18 14:02:53.244][subgit-install][1] Checking path "tags/nightly-build/2020-11/26-0331/XXX/XXXXX/XXXXXXXXXXXXXXXXXXXX/XXXXXXXX/XXXXXXXXXXX" for changes within layout
[2022-11-18 14:02:53.244][subgit-install][1] It is not inside the layout
[2022-11-18 14:02:53.244][subgit-install][1] Checking path "tags/nightly-build/2020-11/26-0331/XXX/XXXXX/XXXXXXXXXXXXXXXXXXXX/XXXXXXXX/XXXXXXXXXXX" for changes related to the layout

And this goes on for quite a while, which is not what I would expect, as the path (…2020-11/…) is not matching the branch mapping that is configured

We have two mappings with a wildcard, which work without any issues, so it’s a little bit perplexing why it does not work also in this case:

branches = branches/product/*/product:refs/heads/releases/*
branches = branches/features/XXX/*/product:refs/heads/features/*

Additionally there are the following unrelated logs, which also shows duplicates, which is kind of odd:

[2022-11-18 13:17:00.665][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3843' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:01.604][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3845' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:02.184][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-2039' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:02.941][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-2790' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:03.527][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/testIfIntegrationTestsWork' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:03.749][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3719' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:04.683][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-1895' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:05.260][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-1453' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:05.660][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY2841' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:06.066][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-1297' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:06.819][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3390' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:08.647][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3789' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:09.941][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3421' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:10.524][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-2050' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:12.715][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-2841' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:12.938][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-2323' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:13.520][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-2641' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:14.633][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3108' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:21.402][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3843' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:22.339][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3845' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:22.920][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-2039' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:23.676][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-2790' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:24.254][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/testIfIntegrationTestsWork' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:24.478][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3719' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:25.416][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-1895' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:25.991][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-1453' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:26.392][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY2841' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:26.789][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-1297' path not found: 404 Not Found (DOMAIN)
[2022-11-18 13:17:27.542][subgit-install][1] svn: E160013: '/svn/xxx/branches/features/xxx/YYY-3390' path not found: 404 Not Found (DOMAIN)
  • Are we using this feature incorrectly?
  • Is this an expected behavior, that even though paths are excluded, SubGit still somehow needs to process everything?
  • Would it work if we would define explicit exclusions without wildcards?

Thank you in advance

Hello Patrick,
it looks like you understand all the features and options correctly, but something is probably wrong with the URL.

May I ask you what URL do you use on this page?


I mean, is it like:

http://host/svn/xxx

or e.g. is it

http://host/svn/xxx/trunk

or does it have some another form?

SVN Mirror concatenates the URL with trunk/branches/tags/shelves option to get the resulting URL.
In your case from the log you’ve provided that it concatenates something like

http://host/svn/xxx

with your relateive branch path, e.g.

branches/features/xxx/YYY-3845

and obtains the resulting URL

http://host/svn/xxx/branches/features/xxx/YYY-3845

and your SVN server responds with 404 error code, i.e. the resulting URL is invalid.

You know the situation better than me, so there’re questions to you:

  • is “http://host/svn/xxx/branches/features/xxx/YYY-3845” correct?
  • if no, which URL for that branch would be correct?
  • the correct URL ends with “branches/features/xxx/YYY-3845”, right?
  • if you remove “branches/features/xxx/YYY-3845” from the end of that correct URL, this is the URL you should have specified in the settings. Do you have the same URL in the settings?

Yes, you can, like

excludeBranches = branches/features/xxx/YYY-3845

Normally all the approaches you describe should work, your intuition is correct.

I think the “404 Not found” issues are due to those branches being deleted in the past, as they are not contained in the repository anymore. So I guess we are fine with this.

There is still the issue with the mapping though. With the following mapping:

branches = tags/nightly-build/2022-11/*/product:refs/heads/nightlies/2022-11/*

The following log messages:

[2022-11-18 13:53:44.126][subgit-install][1] Checking path "tags/not-nigthly-build/18.0.3/beta/2019-05/15-1604/XXX/XXXX/XXXXXXXXXXXXXXXXXXXX/XXXXXXXX/XXXXXXXXXXX" for changes within layout
[2022-11-18 13:53:44.126][subgit-install][1] It is not inside the layout

are seen. It’s somehow looking not only within the specified path but also in every child path of “tags”, which in our case is a huge directory, full of things, we are not really interested in.

Can we configure SubGit, to only look into the specified paths and sub-paths, otherwise this operation is taking way too much time.

Where can I find the page you are referring to? In the config file we have http://host/svn/xxx, which is already working for other branches.

The message “It is not inside the layout” is generated while SubGit analyzes an analog of “svn log -v” to check whether the paths changed the revision being fetched correspond to the trunk/branches/tags/shelves rules or not. It does so to find the changed branches and maps these branches to the Git commits. Then it applies the deltas sent by SVN to these commits.

Theoretically, this step shouldn’t take much time. It doesn’t depend on the number of tags in general but on the number of paths changed in that particular revision. But even if many paths are changed, this check is pretty quick as it’s just string pattern matching. As you see from the piece of log you cite, the check lasts less than 1ms.

“tags/not-nigthly-build/…” clearly doesn’t correspond to any rules and in particular to the rule above. So it should be skipped, this is why you see this message.

If all your rules consist only of one ‘branches=’ rule (I assume your URL is “http://host/

branches = tags/nightly-build/2022-11/*/product:refs/heads/nightlies/2022-11/*

then you can narrow down the scope by moving the previx of the rules to the URL, i.e. have the rule:

branches = */product:refs/heads/nightlies/2022-11/*

and the URL: http://host/tags/nightly-build/2022-11

The result will be absolutely the same but anything beyond this more narrow URL won’t be considered.

Do you have [many] ‘excludePath=’ rules?
Normally, SubGit is already optimized to get from the server as little as possible, the only room for optimization I see (except the idea above) is when you use ‘excludePath=’ option.

You write “this operation is taking way too much time”. Could you cite the relevant part of the log. I don’t mean the full part of the log (as it could take too much effort to obfuscate) but qualitatively: it is true that you have a block like this one:

[TIMESTAMP1][subgit-install][1] Checking path "tags/xxx/xxx/xxx" for changes within layout
[TIMESTAMP1][subgit-install][1] It is not inside the layout
...
[TIMESTAMP2][subgit-install][1] Checking path "tags/yyy/yyy/yyy" for changes within layout
[TIMESTAMP2][subgit-install][1] It is not inside the layout

and there’s a significant time (minutes, tens of minutes) between TIMESTAMP1 and TIMESTAMP2?

Normally this should be quick because at that stage it doesn’t fetch the revision from SVN but analyzes the revision (the log entry for that revision) to construct a subsequent SVN query that would fetch the revision minimum possible amount of data transferred. The real fetch operation starts with something like “SET_PATH” in the log.

In order to test out how long this whole process will take for our repository, I created a new repository in order to not block the devs and used the following configuration:

minimalRevision = 100218
url = http://host/svn/YYY
trunk = trunk/product:refs/heads/main
branches = branches/product/*/product:refs/heads/releases/*
branches = branches/features/XXX/*/product:refs/heads/features/*
branches = tags/nightly-build/*/*/product:refs/heads/nightlies/*/*
excludeBranches = branches/features/XXX/XXX/product
excludeBranches = tags/nightly-build/2014*
excludeBranches = tags/nightly-build/2015*
excludeBranches = tags/nightly-build/2016*
excludeBranches = tags/nightly-build/2017*
excludeBranches = tags/nightly-build/2018*
excludeBranches = tags/nightly-build/2019*
excludeBranches = tags/nightly-build/2020*
excludeBranches = tags/nightly-build/2021*
excludeBranches = tags/nightly-build/2022-0*
excludeBranches = tags/nightly-build/2022-10

I started this yesterday at 12:45 pm and it’s still running today at 17:00 pm.

We started with r100218 and the current SVN revision is r159040, that means we need to go through 58.822 revisions

Yesterday after ~2h30m there were were ~24.000 revisions done but since then the whole process has been getting slower and slower.

There are commits, which are done in milliseconds (e.g. r139529) and then there are commits which take 10+ minutes (e.g. r100218 or r139521 see attached files) for just retrieving the content:

[2022-11-24 08:13:56.355][subgit-install][1] SET_PATH '' 139529 not empty depth=infinity
[2022-11-24 08:13:56.355][subgit-install][1] SET_PATH 'branches' 139529 not empty depth=infinity
[2022-11-24 08:13:56.355][subgit-install][1] SET_PATH 'branches/features' 139529 not empty depth=infinity
[2022-11-24 08:13:56.355][subgit-install][1] SET_PATH 'branches/features/XXX' 139529 not empty depth=infinity
[2022-11-24 08:13:56.355][subgit-install][1] SET_PATH 'branches/features/XXX/XXXXXXXXX' 139529 not empty depth=infinity
[2022-11-24 08:13:56.579][subgit-install][1] fetching: branch = refs/svn/root/branches/features/XXX/XXXXXXXXX/product, revision = 139529, receivedFileCount=652592
[2022-11-24 08:13:56.642][subgit-install][1] fetched: hash = 56bec29f2e6561636fc59affe2537e7a518eb285, branch = refs/svn/root/branches/features/XXX/XXXXXXXXX/product, revision = 139529
[2022-11-24 08:13:56.643][subgit-install][1] Updating latest fetched revision for svn-remote "svn" to r139529
[2022-11-24 08:13:56.648][subgit-install][1] SET_PATH 'branches/features/XXX/XXXXXXXXXX/product' 139529 not empty depth=infinity

r100218.txt (239.7 KB)
r139521.txt (390.1 KB)

Currently we have ~42.000 revisions done, so still ~16.000 to go. That means in ~26 hours (from ~3pm yesterday until 5pm today) it went through ~18.000 revisions, which is kind of slow and as already mentioned seems to be getting slower over time.

Is there anything we can do to speed this up?

We are currently at r148396 since yesterday 5pm, which means ~6.000 revisions in ~16 hours. As feared it seems to be getting slower.

Hello Patrick,
thanks for the information, it has helped a lot!

Indeed, the fetch process itself which happens between the first “SET_PATH” and "fetched: hash = " message is sometimes fast.

The messages like
Getting content of "tags/nightly-build/2018-01" directory at revision 100218
corresponds to the preparation phase. SubGit is trying so hard to optimize “fetch” itself, that it now spends too much time in the “preparation” phase to construct an optimal query.

Theoretically SubGit could do the opposite: generate the most straightforward query like “dear SVN, I have revision 100217, give me a diff between 100217 and 100218, so that I could apply it to 100217 which I have locally” in a millisecond or nanosecond. At the moment we don’t have any switch/option to do that, but it would be relatively easy for me to add it. And also I could add the heuristics, to switch between “too smart” or “too straightforward” behaviour depending on the number of branches/tags as the current behaviour performs very well on 99% of repositories, your case it pretty exceptional.

Why do you see so many messages like
Getting content of "tags/nightly-build/2018-01" directory at revision 100218
?
Usually this happens when not a branch or tag is changed but some of its parent. I.e. tags/nightly-build/*/*/product itself is not changed but the change is like:

D /tags/nightly-build

or

R /tags/nightly-build/XXX (from /tags/nightly-build/YYY:12345)

or

A /tags/nightly-build/ZZZ (from /tags/nightly-build/YYY:12345)

etc. In the last case it is possible to tell SVN: “dear SVN, I have 12345 locally, send me the diff between 12344 and 12345” (this is straightforward approach). But such a diff would be huge as SVN would basically send /tags/nightly-build/ZZZ. Instead, SubGit generates many individual branch diff queries for every branch in /tags/nightly-build/ZZZ. Do do that SubGit should list /tags/nightly-build/ZZZ directory up to some depth to find actualy branch in it and then generate diff query for every branch (smart approach). But listing takes too much time in your case, and actually subsequent individual diff queries would take much time as well.

Could we investigate some certain revision as the example? E.g. r100218 or the revision SubGit is stuck at at the moment. What I would like to ask you to send us is:

  • the outut of svn log -r$REV -v URL for both 1) the revision where it prints “Getting content …” and the next revision (REV+1) because it actually gets content for both the old and the new revision, so there’s a chance that either REV or (REV+1) is being fetched;
  • if SubGit is stuck at some revision, the output of the following command would help:
jstack -l $PID

where PID is SubGit process ID, it could be determined with

jps

command, ‘jps’ is included into every JDK distribution;

  • what’s your SubGit version? (it would be useful to decode ‘jstack’ output).

With that information, I’ll try to find out which optimizaton could help in your case.
If you can’t show exact branch/tag names, their obfuscated variant would be totally ok, it’s important to understand the picture qualitatively.

As far as I can see, SubGit is not stuck, it just takes a lot of time to process all the revisions.

We are using SubGit version 3.3.13.

This is the output for r100218/100219 and r139521/139522:

     svn log -r100218 -v http://HOST/svn/YYY
------------------------------------------------------------------------
r100218 | xxxxxxxxxxxxxxxxx | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | 1 line
Changed paths:
   A /trunk/product


------------------------------------------------------------------------

     svn log -r100219 -v http://HOST/svn/YYY
------------------------------------------------------------------------
r100219 | xxxxxxxxxxxxxxxxx | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | 1 line
Changed paths:
   A /trunk/product/main


------------------------------------------------------------------------

     svn log -r139521 -v http://HOST/svn/YYY
------------------------------------------------------------------------
r139521 | xxxxxxxxxxxxx | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | 1 line
Changed paths:
   M /trunk/XXX/XXXX/XXXXXXXXXXXXXXXXXXXXXX/XXXX/XXXX/XXX/XXX/XXXXX/XXX/XXXXXXX/XXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXX

XXXX-12937: XXXXXXXX XXXXXXX XXXXXXXXXXX XXXX
------------------------------------------------------------------------

     svn log -r139522 -v http://HOST/svn/YYY
------------------------------------------------------------------------
r139522 | xxxxxxxxx | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | 1 line
Changed paths:
   A /branches/features/XXX/XXXX-3551_Branch2
   A /branches/features/XXX/XXXX-3551_Branch2/product
   A /branches/features/XXX/XXXX-3551_Branch2/product/main
   A /branches/features/XXX/XXXX-3551_Branch2/product/main/product-XXXXXXXXXX (from /trunk/product/main/product-XXXXXXXXXX:139521)

XXXX-3551_Branch2: create feature branch for product-XXXXXXXXXX
------------------------------------------------------------------------

Thanks for the log, now everything is crystal clear. For r100219 you have the rule:

trunk = trunk/product:refs/heads/main

and

------------------------------------------------------------------------
r100218 | xxxxxxxxxxxxxxxxx | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | 1 line
Changed paths:
   A /trunk/product


------------------------------------------------------------------------

At the “preparation” phase SubGit thinks: we have an addition of a branch and the addition is without copy source. It could seem that such a log entry can only correspond to an addition of an empty directory but this is not always true. It could also correspond to an addition of a directory as a copy of another big directory but this another directory could be unavailable to the SVN user on behalf of which SubGit acts due to path-based authorization. In such cases (“from /blah/blah/blah:12345” is not shown).

So SubGit understands that if we ask for a diff r100217:100218, it could be potentially huge. Instead SubGit tries to find any branch/tag that has been already fetched, i.e. any branch or tag r100217 (then it would ask SVN to send the diff between that branch/tag and “/trunk/product” hoping the difference would be small). But it does that in a silly way by getting … all the branches and choosing any of them.

We definitely should change that and that would solve the problem.

For r139522 the situation looks to be the same:

branches = branches/features/XXX/*/product:refs/heads/features/*

rule applies to

A /branches/features/XXX/XXXX-3551_Branch2/product

and again, the branch is added without a copy source. So SubGit tries to find any branch as a potential source to it and then it plans to ask SVN for a diff between that “any” branch and the branch being fetched.

Unfortunately, so far I can’t propose any quick work-around.

I’ll fix the bug and issue a new SubGit build, so that it will work fine in your case.

Thank you very much for your feedback and support. :)

Unfortunately the exclusion did not work properly. After ~75 hours the whole process was done but there are not only the branches for the …2022-11… tags but all the way dating back to 2019. This might has also increased the processing times.

Is there anything wrong with the patterns I have provided?

Thank you in advance

Hello Patrick,
you understand the behaviour of all the patterns correctly, it’s just our fault that SubGit behaves slowly in such situation. As I wrote, I knew no work-around for the SubGit version you use.

But now I’ve built a version 3.3.17-rc2, could you give it a try? I believe, it should significantly speed up everything in your scenario.

To try it, run

subgit install GIT_REPO

with that new 3.3.17-rc2 version.

I hope, it helps. If it does not, may I ask you to run and send us the output

jstack -l <PID>

where PID is the process ID of SubGit? You can find it using jps command. Both ‘jstack’ and ‘jps’ are parts of standard JDK distribution. The ‘jstack’ command dumps threads, so we will find out, where the most of the time is spent. It’s better to do that several times, e.g. every several seconds.

Trying to open the link, shows me the following login screen:

Hi Patrick,

use the “Log in as guest” feature, it should provide the access.