Is checksum verified during commit?

iloving · April 27, 2020, 5:07pm

We’ve been running into occasional issues where we discover that a commit in our repo is occasionally corrupted, which has resulted in me having to rebuild the entire repo to excise the broken commit.

We’ve been racking our brains trying to figure out the cause, including digging through the code (which was a challenge since it’s been years since I had to roll my sleeves up with java), and so far the only lead we have is the possibility that the data is somehow getting corrupted mid-commit.
I wanted to verify whether SVNKit verifies the checksum sent by the client matches the content sent? If not, is this something that can be added?

dmitry.pavlenko · April 27, 2020, 8:57pm

Hello Ilsa,
could I ask you for more details?

Which SVNKit version are you using? I hope you’re using the latest one, right?
What do you mean by “corrupted” repository? How do you discover they are corrupted? (errors while operations on the repository / no errors but some file is missing or has wrong content / no errors but “svnadmin verify” complains about it / errors while creating such revision) . What command fails? Or is it GUI client that fails? Or is it SVNKit itself cannot work with the SVN repository (in this case it’s interesting to check SVNKit vs native SVN whether their behaviour differs)?
How do you access your SVN repository? There’s a huge difference between direct access using file:/// protocol and all other protocols. In the former case it’s SVNKit code that works with the SVN repository at the filesystem level while in the latter case it’s svnserve / mod_dav_svn, it’s very important to know.
How do you use SVNKit? Do you use it as a Java library with your own code, do you use it as a part of some Java based SVN client with GUI, or do you use ‘jsvn’ command line utility?
The issue happened just once, right? Or several times?

The more details you provide, the easier it would be for me to help you.

SVNKit always verifies checksums of the files sent and received if they are changed or added. It’s simply a part of SVN protocol. Also, if you are using any protocol different from file:///, the server side software does the same. So it’s not a thing that can be added as it’s always on.

If you give me more details on what’s exactly going on, probably I would be able to find out how to solve that.

iloving · April 27, 2020, 9:21pm

We’re currently using scm-manager 1.6, which is using v1.9.0 of SVNKit. We’re planning on updating to 2.0 (which uses the latest 1.10) but that largely depends on whether this issue is solved. We don’t want to go through all the effort to upgrade the server only to discover that the issue remains.
Corrupted revision. Metadata is ok, but the checksum for the revision content fails checksum on verify. The obvious symptoms are that people get 503 errors when they try to update their working copy, and we back-tracked it to a corrupt revision affecting a file path in their working copy.
3/4. We’re using scm-manager ( https://www.scm-manager.org/ ) to mange the repos. I’m not entirely sure what mechanism it uses for the actual repo access. On the client side, we’re using every variation possible including various SVN GUIs as well as command line tools. The people with corrupted commits have varied greatly, from TortoiseSVN on Windows to SmartSVN on Mac.
The issue has now happened repeatedly. Before it would occur once every few months, but recently it’s happened twice in almost as many weeks, which has raised alarm bells.

The biggest problem is that it appears to be entirely random. I have spent an entire afternoon trying to forcibly reproduce the problem to no avail, using a variety of different client configurations.

In desperation I tried digging through the Java code and it appeared that the code wasn’t verifying the checksum received from the client. As I am completely unfamiliar with the code-base and how the underlying svn protocols actually function, I’m going to assume I was just looking in the wrong place if you say that the checksums are in indeed confirmed.

dmitry.pavlenko · April 27, 2020, 10:00pm

It sounds like https://issues.tmatesoft.com/issue/SVNKIT-743 issue.

Your repositories were created with SVN >= 1.9, right? It’s easy to check: look at SVN_REPO/db/format file, if yes, it contains “addressing logical”. In this case you could be really affected.

I looked through internet and indeed scm-manager is using SVNKit’s svnkit-dav component and we had a problem in it and the problem has been fixed recently (>=1.10.1; SVNKit r10767, on 23th of April, 2019). This ‘svnkit-dav’ component is a Java analogue of mod_dav_svn module of Apache, so SVNKit inside scm-manager works with the SVN repository directly.

It doesn’t relate to checksums because checksums are the way to make sure the data were transferred correctly but the problem happened at the level of writing the correctly transferred data to the hard drive.

So I strongly recommend you to make sure you’re using the latest version of SVNKit (>=1.10.1) and maybe even check scm-manager jars to be sure.

By the way, thanks for the detailed and clear description of the problem, that helped a lot.

For now I think the best thing to do would be to upgrade to the latest SVNKit. If the problem is still reproducible after that, this could mean it’s an unknown issue, contact us again then.

iloving · April 28, 2020, 12:55am

The contents of $REPO/db/format of the repo in question is:
6
layout sharded 1000

So I don’t know if this is the same issue. We will however look into updating the software and see if that makes a difference.

dmitry.pavlenko · April 28, 2020, 1:26pm

No “addressing logical” line, so it’s a different issue. In either case I would recommend you to upgrade scm-manager but now I doubt it would solve the issue.

I wonder if you could send us an example of a corrupted repository so we could know what exactly is wrong with the repository?

iloving · April 28, 2020, 2:50pm

Unfortunately the repo contains sensitive information so can’t make it available to you, but if there’s anything in particular you’re looking for, or test, I can see about getting that to you. The corruption is limited to a particular revision, and following the usual steps of dump->omit bad revision->load mitigates the problem.

One thing I didn’t mention, is that when I rebuild the repository I usually do it from the command line out of expediency. The version of svn cli tools I have installed right now is 1.8.19.

iloving · April 28, 2020, 6:56pm

Also, and this may or may not be relevant, but two user in particular to have run into these issues the most often have their working copies checked out into a onedrive folder, which is what originally gave me the idea that maybe there is a race condition and that the files are being modified after the checksum header was sent to the server. This assumes, of course, that the checksum header is sent before the actual file content.

dmitry.pavlenko · April 28, 2020, 7:03pm

The sensitive information a big limitation for us. Do you have the corrupted repository preserved somewhere or have you deleted it? To start investigation I should understand what exactly bad about your repository. I wonder if you have logs in the scm-manager. What I would like to look at is a stack trace of the exception which happens when you receive 503 error. The stack trace should show which exactly method fails and at least it would give us some clue about the problem.

Probably to get the log you should setup more detailed logging in scm-manager, I don’t know if this is possible and how to do that. If you have problems with that, there’s another thing to try: setup SVNKit’s own logging system: https://wiki.svnkit.com/Troubleshooting

Here’s the example of logging.properties file: https://svn.svnkit.com/repos/svnkit/shelves/reporter/svnkit-cli/src/main/conf/logging.properties

But that’s only for the case if scm-manager’s own logging doesn’t work.

iloving · April 28, 2020, 7:07pm

I do still have a copy of the bad repo, so I should be able to at least reproduce the process of the broken checkout. Obviously I won’t be able to do much for the process that added the bad revision in the first place.

dmitry.pavlenko · April 28, 2020, 7:27pm

Reproducing of the broken checkout process could be useful to see the error message and the stack trace. So I’m waiting for the information from your side.

In the past we had issues with race condition ( https://issues.tmatesoft.com/issue/SVNKIT-719 ) related to simultaneous access to the repository using SVNKit and native SVN tools ( file:/// protocol or svnserve or mod_dav_svn ). The problem happened because there’re 2 locking mechanisms on Linux: BSD locks (aka FLOCK; used by native SVN since some version) and POSIX locks (old native SVN programs and SVNKit) and they don’t interact with each other. We’ve solved the problem by taking both locks in SVNKit. I don’t think this is the issue you have but other than that I don’t remember any known race condition problem and even this problem never lead to a corrupted repository.

As you can see from FSCommitter#commitTxn method source, the new revision creation is protected by a write lock.

iloving · April 28, 2020, 9:00pm

Ok, here’s what I get:

$ svnadmin verify $REPO_PATH

--snip--
* Verified revision blah blah blah
* Verified revision 17246.
svnadmin: E160004: Filesystem is corrupt
svnadmin: E200014: Checksum mismatch while reading representation:
   expected:  dfa50c38ce39131f9ece6e3380fae8fa
     actual:  8606b7a137ebde76e4d620d51cfa6a13

$ *svnadmin verify $REPO_PATH -r 17248:HEAD

--snip--
* Verified revision 17254.
* Verified revision 17255.

$ svn checkout https://$REPO_PATH

A    $PATH_TO_FILE
svn: E175009: The XML response contains invalid XML
svn: E130003: Malformed XML: no element found at line 67253939

The logs I get after performing above are… not at all what I expected. There is a dozen occurences of the following, corresponding to various paths:

2020-04-28 16:14:14.672 [qtp758529971-32] DEBUG svnkit.fsfs - svn: E160013: Attempted to open non-existent child node 'scm'
2020-04-28 16:14:14.673 [qtp758529971-32] DEBUG svnkit.fsfs - svn: E160013: File not found: revision 17,255, path '/scm/svn/$REPO_PATH/$VARIOUS_DIR_IN_REPO'
2020-04-28 16:14:14.678 [qtp758529971-32] DEBUG svnkit.fsfs - svn: E160013: Attempted to open non-existent child node 'scm'
org.tmatesoft.svn.core.SVNException: svn: E160013: Attempted to open non-existent child node 'scm'
        at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:70)
        at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:57)
        at org.tmatesoft.svn.core.internal.io.fs.FSRevisionNode.getChildDirNode(FSRevisionNode.java:593)
        at org.tmatesoft.svn.core.internal.io.fs.FSRoot.openPath(FSRoot.java:101)
        at org.tmatesoft.svn.core.internal.io.fs.FSRoot.getRevisionNode(FSRoot.java:58)
        at org.tmatesoft.svn.core.internal.server.dav.DAVServletUtil.getSafeCreatedRevision(DAVServletUtil.java:58)
        at org.tmatesoft.svn.core.internal.server.dav.handlers.DAVUpdateHandler.addVersionURL(DAVUpdateHandler.java:725)
        at org.tmatesoft.svn.core.internal.server.dav.handlers.DAVUpdateHandler.writeAddEntryTag(DAVUpdateHandler.java:648)
        at org.tmatesoft.svn.core.internal.server.dav.handlers.DAVUpdateHandler.addDir(DAVUpdateHandler.java:521)
        at org.tmatesoft.svn.core.internal.io.fs.FSUpdateContext.updateEntry(FSUpdateContext.java:531)
        at org.tmatesoft.svn.core.internal.io.fs.FSUpdateContext.diffDirs(FSUpdateContext.java:413)
        at org.tmatesoft.svn.core.internal.io.fs.FSUpdateContext.drive(FSUpdateContext.java:310)
        at org.tmatesoft.svn.core.internal.io.fs.FSRepository.finishReport(FSRepository.java:560)
        at org.tmatesoft.svn.core.internal.io.fs.FSTranslateReporter.finishReport(FSTranslateReporter.java:47)
        at org.tmatesoft.svn.core.internal.server.dav.handlers.DAVUpdateHandler.execute(DAVUpdateHandler.java:431)
        at org.tmatesoft.svn.core.internal.server.dav.handlers.DAVReportHandler.execute(DAVReportHandler.java:177)
        at org.tmatesoft.svn.core.internal.server.dav.DAVServlet.service(DAVServlet.java:136)
        at sonia.scm.web.SvnDAVServlet.service(SvnDAVServlet.java:127)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
--snip--

I don’t understand why it’s saying file not found. If it’s referring to $REPO/db/rev/17/17255 the file is there, with ownership, permissions, and selinux context the same as all the other revisions. I assume I must be barking up the wrong tree. I’ve placed the full (but redacted) log here: https://pastebin.com/YumUtAAs

dmitry.pavlenko · April 29, 2020, 1:26pm

If the problem is only about checksum you could try the following work-around:

 svnadmin dump /path/to/repo > repo.dump
 svnadmin create /path/for/recovered/repo
 svnadmin load /path/for/recovered/repo < repo.dump

As I know, svnadmin recalculates checksums on the fly. But this is just a work-around (not sure if it would work, but it’s worth trying).

What about the absent node: it’s not about a file absence but about a block in the file.

$REPO/db/rev/17/17255

file corresponds to r17255 and consists of blocks corresponding to nodes, i.e. files and directories inside that revision. So the block inside this file is missing according to the error. The blocks contain properties of the nodes, content of the nodes (full or as delta to blocks in some previous revisions).

You check out r17255, right?

I think it could be useful to look inside ‘17255’ and ‘17246’ files using a hex viewer (e.g. with ‘hexdump’ or ‘hd’ command or any other hex viewer). The files are relatively human-readable.

Probably (correct me if I’m wrong) you have $PATH_TO_FILE at ‘17255’ that is kept as delta to ‘17246’. It’s interesting to find mentions of $PATH_TO_FILE in both files and to find out if there’s something strange (different) about the files, e.g.

one of the files looks like interrupted in the middle
or one of the files doesn’t contain “add-file” records responding for the ‘svn log’ command
or do you have some revisions created close in time to each other in SVN log for that interval [17245:17256] that could prove the race condition.

It’s really hard to tell what could be weird without accessing the repository.

Normally all revision files $REPO/db/rev/17/* should have approximately similar structure, if some of them radically differs, that could be a clue.

Also I wonder if there’s something unusal about your filesystem or OS? I mean something that could cause file locking mechanism fail, e.g. filesystem on NFS/network drive.

iloving · April 29, 2020, 9:39pm

I am aware of the recovery process to excise a corrupt revision from the repository. The problem is that I’ve had to do that several times now, and everyone using the repo subsequently needs to re-check out that repo, which is a major disruption for all involved.

To answer your question, I get the invalid XML response when I update to any revision >= r17247. r17255 just happens to be the last commit in the repo and since svn checks out HEAD by default, etc etc. I’m doing another checkout all the way to 17255 but specifically omitting the directories affected by r17247 and things are checking out just fine.

The repo in question is being used in a very unconventional manner that involves storing a lot of office documents in different directories. I am working to improve the situation but this is the situation I have inherited.

Right now, my current theory is this. The users are checking out their working copies into a OneDrive folder so that they can both commit changes to SVN as well as take advantage of Microsoft’s collaborative features. They are typically using TortoiseSVN to interact with the svn server. I believe that OneDrive is modifying the files after the commit checksum has been calculated, but before TSVN has finished uploading the data.

If this hypothesis is correct, then it will result in a corrupted commit. All of the usual mechanisms that would catch the bad data would not catch this in transit because the corruption didn’t occur mid-stream… it would have occurred before the data was sent.

The only way to catch this would be for SVNKit on the server side to validate the checksum provided by the client, prior to writing the commit to storage.

One of the scm-manager developers was nice enough to go through the SVNKit code for me and verify that SVNKit does not validate the incoming checksum, so this bizarre scenario appears to be a possibility.

The only thing I can’t explain is why the error is specifically an XML error. Even if the files being committed are invalid, one would assume that only the CDATA (I’m assuming CDATA… I have no idea what the XML actually looks like.) is affected and that the XML structure itself should be fine.

I wish I could give you access to the repo. Perhaps if I was to talk to my superiors, we could maybe do a Zoom session and you could examine the repo directly?

dmitry.pavlenko · April 30, 2020, 7:06pm

The Zoom session would be the best option. If you could send to us some individual files of the repository, it would also be nice.

What about your hypotheses, I think none of them is correct.

First of all, even if OneDrive or something else made working copies inconsistent, this should have no influence on the server-side repository integrity as the server side software (svnkit-dav) doesn’t rely on working copies integrity and even existence. So I think if we solve the problem, you can continue using OneDrive without any issue.

Regarding not checking checksum by SVNKit this could be so. Just a note for myself: have a look at DAVPutHandler.java:135 and put the following code there

        final String checksum = ((FSDeltaConsumer) deltaConsumer).getChecksum();
        final String resultChecksum = resource.getResultChecksum();
        //TODO compare checksum with resultChecksum

But I don’t think this is the cause. For this to be the cause TortoiseSVN should be buggy as well (I believe this is not so). And also if the problem were about only checksum, there would be no “file not found” error.

What about XML error: it’s just a consequence of the error happened in the middle of XML generation, so once the original problem is gone, XML problem will disappear.

Regarding “any revision >= r17247”, as I wrote, I think that content of the file that causes problem is kept in the repository as a delta relatively to content of this file at r17246. So as the base to the delta is incorrect for some reason, all the consequent revision cannot apply delta(s) to that base.

iloving · May 4, 2020, 9:24pm

Unfortunately I can’t send you the files in question as they are confidential. Also, it shouldn’t be an issue with the deltas as the files in question are binary blobs and so should be stored whole.

When is a good time for this zoom meeting, and do you have a preferred way for me to get you that zoom link? I don’t want to paste it on a public forum for obvious reasons.

dmitry.pavlenko · May 5, 2020, 9:18am

Drop me a line to pavlenko@tmatesoft.com
Usually I’m available between 12:00 and 20:00 CET

dmitry.pavlenko · May 15, 2020, 5:03pm

I’ve added checksums checking at r10789, the relevant issue in our issue tracker: https://issues.tmatesoft.com/issue/SVNKIT-754

We care so much about binary compatibility that you can build SVNKit JAR from trunk sources using the following command and then replace the SVNKit in scm-manager with the new JAR. Thus you could avoid full scm-manager upgrade and have the most stable SVNKit version in it immediately.

./gradlew svnkit:clean svnkit:build -x svnkit:javadoc -x svnkit:test svnkit-cli:clean svnkit-cli:build svnkit-distribution:clean svnkit-distribution:build

The new JAR with the fix can be found at: “./svnkit/build/libs/” subdirectory.

P.S. thanks for the interview, we’re now using one of your phrases as a slogan :)

dmitry.pavlenko · May 18, 2020, 10:46am

I also tried to reproduce the same scenario (svnkit-dav + PDF file + rename + change the renamed file) but without any issue. So probably for now I’ve done my best about this issue and can’t do more.

iloving · May 19, 2020, 2:33pm

Agreed. Without more information I don’t see how we can do more. But it will be interesting to see if that code change actually catches a bad commit.

TYVM for your help!