Where can I find SVNKit parsing revision (revs) file, or creating a revision file?

mkatich · November 7, 2022, 9:47pm

Hi,

I’m trying to understand the structure of revisions files. I have read Subversion’s documentation already, but it does not have thorough information. I have also been looking at existing revision files in a hex editor. I am exploring SVNKit to learn more since Java is easier to read for me than C. I would like to build code that does parsing of revisions files from a repository. I don’t plan to recreate Subversion functionality. But I would like to learn how a revisions file is created, meaning the actual bytes written to that, so that I can read it properly.

I have been using a debugger and exploring how SVNKit does verify a revision. It is difficult to understand what is happening so far. FSFS.java, FSFile.java, FSRepositoryUtil.java, SVNFileUtil.java look very interesting since they have file I/O.

I know it can be a complicated process. I have learned about the different FSFS formats a little since they vary on how they save to files. I see that FSFS.readDBFormat() read the DB format file to learn the version of FSFS that is used.

SVNKit can do all Subversion commands, so I assume it creates new revs files, right? Do you know if there a particular test file at svnkit - Revision 10852: /trunk/svnkit/src/test/java/org/tmatesoft/svn/test which would be most helpful to look at? (that is a long list of test files)

I am currently looking at the P2L index checksum and L2P index checksum in revs files. I don’t see how these checksums are created. They are 16 bytes, so I guess they are MD5 hashes.

I appreciate any help!

dmitry.pavlenko · November 8, 2022, 2:33pm

Hello,

I don’t know if you’ve found the description of FSFS structure and its indexes among Subversion documents, if not, they would be useful to read. Also I would like to note that Subversion is planning to switch to FSX instead of FSFS and I’m not sure if SVNKit will keep up (I someone contributes a patch with FSX support, then maybe yes). So there’s a risk that once you learn how to work with FSFS at the low level, it will become obsolete.

Anyway, basically the rev files consist of a list of so called “representations”, and the end of any rev files contains an index. The “representations” contain files content and properties. They can be plain, i.e. contain the data in raw format; or they can be delta representations, i.e. changes compared to some other representations.

In older FSFS versions the representations were addressed as positions inside rev files (in a sense of RandomAccessFile#getPointer and RandomAccessFile#seek). This is referred to as physical addressing in Subversion docs.

In newer versions of FSFS so-called logical addressing was introduced (and it’s now used by default). And to convert from one into another, two indexes were added to the end of each rev file: L2P and P2L indexes.

Also there’re some other features that are optional for FSFS, some of them are supported by SVNKit, some of them are not. E.g. packed revision properties. Or representation cache:

$ sqlite3 rep-cache.db
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> .tables
rep_cache
sqlite> .schema rep_cache
CREATE TABLE rep_cache (   hash TEXT NOT NULL PRIMARY KEY,   revision INTEGER NOT NULL,   offset INTEGER NOT NULL,   size INTEGER NOT NULL,   expanded_size INTEGER NOT NULL   ) WITHOUT ROWID;
sqlite> select * from rep_cache;
da39a3ee5e6b4b0d3255bfef95601890afd80709|1|3|4|0

By the way there’s one of the reasons SVNKit uses SQLJet library: to read and write representation cache. But without the cache both SVNKit and Subversion should work as well.

Regarding your questions: P2L and L2P indexes are using FNV-1a checksums (see FSFnv1aInterleavedChecksumCalculator), not MD5. You can look at the corresponding classes: FSP2LProtoIndex and FSL2PProtoIndex. “Proto” means “yet-to-be” in Subversion context, they are created while “svn commit” and then dumped to the end of the real rev files. That’s why there’s “txn-protorevs” directory (i.e. “future rev files”) in FSFS repository.

What about the tests, they are are too highlevel. Only FSFileTest and PackedRevPropsTest do low level testing. Otherwise we rely on 3000+ Subversion python tests (‘svnkit-test’ component).

I hope, this information helps. Subversion FSFS documentation together with SVNKit debugging should make things clear. If you have any concrete questions, feel free to ask, otherwise the topic is too wide for me to describe it here.

mkatich · November 9, 2022, 6:58pm

Thank you. That is very helpful.

I have read the first link many times now. I think I’ve read the second link but I will look at it again as I learn more.

I’m wading through the code, and wow it takes a lot of exploring to see how it actually makes the checksum and writes data to the files. There are many layers and branches here. That is great to know it does FNV-1a hashing. I haven’t heard of that one before. I was going the wrong direction with that.

Here are some relevant code bits from lots of various pieces below.

I see that FSTransactionRoot.writeFinalRevision() creates a new FSFnv1aOutputStream object, and that FSFnv1aOutputStream is given a protoFile param. That protoFile is a CountingOutputStream object. The code in writeFinalRevision() then later calls FSFnv1aOutputStream’s finalizeChecksum() method which calls FSFnv1aInterleavedChecksumCalculator’s finalizeChecksum() method to do the actual calculation.

So what data is actually given to FSFnv1aInterleavedChecksumCalculator to calculate the checksum? I guess that is in FSFS.writeTxnNodeRevision() and there I see recognizable elements of the revision files. So is the L2P checksum calculated on the whole revision file prior to the footer? What about the P2L?

I will keep reading but any hints are appreciated. I would like to make my own simple FNV-1a hashing program to see if I can get the same result as the L2P and P2L index checksums I see revision files.

FSTransactionRoot.java

	public FSID writeFinalRevision(...){
	
		[...]
		
        final FSFnv1aOutputStream checksumOutputStream = new FSFnv1aOutputStream(protoFile);

        getOwner().writeTxnNodeRevision(checksumOutputStream, revNode);
        if (representations != null && revNode.getTextRepresentation() != null && revNode.getType() == SVNNodeKind.FILE && 
                revNode.getTextRepresentation().getRevision() == revision) {
            representations.add(revNode.getTextRepresentation());
        }
        revNode.setIsFreshTxnRoot(false);
        getOwner().putTxnRevisionNode(id, revNode);

        if (getOwner().isUseLogAddressing()) {
            final int checksum = checksumOutputStream.finalizeChecksum();
            final FSP2LEntry entry = new FSP2LEntry(myOffset, protoFile.getPosition() - myOffset, FSP2LProtoIndex.ItemType.NODEREV, checksum, SVNRepository.INVALID_REVISION, itemIndex);
            storeP2LIndexEntry(entry);
        }

        return newId;
    }

    public void storeP2LIndexEntry(FSP2LEntry entry) throws SVNException {
        final FSFS fsfs = getOwner();
        final String txnID = getTxnID();

        if (getOwner().isUseLogAddressing()) {
            final FSP2LProtoIndex protoIndex = FSP2LProtoIndex.open(fsfs, txnID, true);
            assert protoIndex != null;
            try {
                protoIndex.writeEntry(entry);
            } finally {
                if (protoIndex != null) {
                    protoIndex.close();
                }
            }
        }
    }

FSFS.java
    public void writeTxnNodeRevision(OutputStream revNodeFile, FSRevisionNode revNode) throws IOException {
        String id = FSRevisionNode.HEADER_ID + ": " + revNode.getId() + "\n";
        revNodeFile.write(id.getBytes("UTF-8"));
        String type = FSRevisionNode.HEADER_TYPE + ": " + revNode.getType() + "\n";
        revNodeFile.write(type.getBytes("UTF-8"));

        [...]

        revNodeFile.write("\n".getBytes("UTF-8"));
    }

FSFnv1aOutputStream.java
    private final OutputStream delegate;
    private final FSFnv1aInterleavedChecksumCalculator checksumCalculator;

    public FSFnv1aOutputStream(OutputStream delegate) {
        this.delegate = delegate;
        this.checksumCalculator = new FSFnv1aInterleavedChecksumCalculator();
    }

    public int finalizeChecksum() {
        return checksumCalculator.finalizeChecksum();
    }

    private void update(byte[] bytes, int offset, int length) {
        checksumCalculator.update(bytes, offset, length);
    }

FSFnv1aInterleavedChecksumCalculator.java
    public int finalizeChecksum() {
        return finalizeChecksum(buffer, 0, buffered);
    }

dmitry.pavlenko · November 9, 2022, 8:25pm

I can be wrong (I wrote that long ago) but FNV-1a is calculated for the representation. Pay attention to CountingOutputStream#resetChecksum method: it is called from the top of FSTransactionRoot#writeHashRepresentation and is the logical beginning of writing a representation. The logical end of writing a representation is FSTransactionRoot#storeP2LIndexEntry and CountingOutputStream#finalizeChecksum is called just above it, because storeP2LIndexEntry() requires a checksum (of the representation).

So CountingOutputStream instance gets used by different representation via resetting. But resetting checksum doesn’t reset CountingOutputStream#myPosition, i.e. the total number of bytes passed through a stream. That’s why it is “counting”.

L2P entry doesn’t contain any checksum, but P2L does. If you set a breakpoint in the middle of committing via file:/// protocol, you’ll typically see the following files:

db
├── current
├── format
├── fsfs.conf
├── fs-type
├── min-unpacked-rev
├── rep-cache.db
├── revprops
│   └── 0
│       └── 0
├── revs
│   └── 0
│       └── 0
├── transactions
│   └── 0-0.txn
│       ├── 1f8cc00171b9b5510b7ac0b9f1055c6be6c524fc
│       ├── changes
│       ├── index.l2p
│       ├── index.p2l
│       ├── itemidx
│       ├── next-ids
│       ├── node._0.0
│       ├── node.0.0
│       ├── node.0.0.children
│       └── props
├── txn-current
├── txn-current-lock
├── txn-protorevs
│   ├── 0-0.rev
│   └── 0-0.rev-lock
├── uuid
└── write-lock

The “proto file” is “0-0.rev”. It has the structure:

<list of representations>

Unlike normal “rev file” it doesn’t have P2L and L2P yet, they are formed independently in index.l2p and index.p2l files. Then at the end of the transaction, the “proto file”, P2L, and L2P are concatenated, and a footer is appended (with the information about P2L and L2P size: to distinguish the indexes from the list of representation).

P2L and L2P are important for FSTransactionRoot#allocateItemIndex. From the code you can easily see that if addressing is “physical”, i.e. fsfs.isUseLogAddressing() returns false, the method just returns the offset of the “proto file”. I.e. the representation (to be written) is addressed by the position in that file (and no L2P and P2L are used in “physical” case).

But for “logical” addressing, first of all “item index file is used” (FSTransactionItemIndex). It’s “transactions/itemidx” file. It’s a text file containing a number. By default it’s 3 (FSID.ITEM_FIRST_USER; 0, 1, 2 — are reserved) and is incremented every time allocateItemIndex() is called (see FSTransactionItemIndex#allocateItemIndex).

E.g.

$ cat transactions/0-0.txn/itemidx
4

Once this “itemIndex” is allocated, a pair (itemIndex, positionOfProtoFile) is stored in L2P in little endian format:

$ hd transactions/0-0.txn/index.l2p
00000000  01 00 00 00 00 00 00 00  03 00 00 00 00 00 00 00  |................|
00000010

Here it’s 3 because to the moment of the allocation the “item index file” didn’t exist (which is equivalent of it containing “3”). So after incrementing we get “4” (for the future) while in L2P index file we store the allocated item index (3) together with the offset in the file which was 1 in my case for some reason.

So as you see, the “item index” is either a position of the representation in “proto file” (and “rev file” because it consists of “proto file” + indexes + some footer with sizes info) in “physical” addressing mode OR it’s a proxy for that position via L2P.

Then writeHashRepresentation() writes the whole representation to the “proto file”, passing it through a CountingOutputStream and calculating the hash of that representation. And finally P2L entry is written, it is the same as L2P entry but with some additional information (note: FSP2LEntry#number = item index) like “checkum”, “type”, etc.

When the transaction is over FSRoot#writeP2LIndex and FSRoot#writeL2PIndex are called (see FSRoot#writeIndexData) to convert these files to real P2L and L2P blocks of the “rev file”, they are not simple copy-n-paste; and the footer with the sizes is written.

FSRoot#writeP2LIndex and FSRoot#writeL2PIndex 's purpose is to discourage you from analyzing them as they do some crazy “offsets and pages” math, details of which I don’t remember. If you only need to implement read-only operations, probably it would be easier to understand how P2L and L2P indexes of the “rev file” are used, not how they are formed from index.l2p and index.p2l.

mkatich · November 9, 2022, 10:37pm

But there are multiple representations within a revisions file. Is that correct? And there is only one L2P index checksum and one P2L index checksum in a revisions file footer.

Docs:

In logical addressing mode, the revision footer has the form
   <l2p offset> <l2p checksum> <p2l offset> <p2l checksum><terminal byte>

I believe logical addressing mode is default. Subversion v 1.9 introduced FSFS format 7, and FSFS format 7 included the introduction of logical addressing. I also see it says “logical” in the format file located under the db directory.

Docs:

Two addressing modes are supported in format 7: physical and logical
addressing. Both use the same address format but apply a different
interpretation to it. Older formats only support physical addressing.

I have one specific Subversion inactive repo in mind which is the focus of my project. I do have another two to look at however. By the way, thank you for the information that the Subversion team plans to switch from FSFS to FSX, but it does not matter for my purposes. This project is a “one-off” and we likely won’t need it to work with future versions of Subversion.

I checked and I do not have any index.l2p or index.p2l or 0-0.rev files in my repositories. So going by your description, perhaps they are temporary files that exist only prior to concatenation.

Example revision file hex (created with Subversion, FSFS format 8) -

I have highlighted some corresponding areas. You can see in green are the L2P and P2L offsets. Blue has the L2P index checksum, and purple has the P2L index checksum.

FSRoot#writeP2LIndex and FSRoot#writeL2PIndex 's purpose is to discourage you from analyzing them as they do some crazy “offsets and pages” math, details of which I don’t remember. If you only need to implement read-only operations, probably it would be easier to understand how P2L and L2P indexes of the “rev file” are used, not how they are formed from index.l2p and index.p2l.

I may have found that already. If that is the same thing I am thinking of, they appear right after “L2P-INDEX” and right after “P2L-INDEX” in the revision files. By looking at a lot of revisions files I was able to determine the math used. It is a crazy calculation.

I will need to do some write operations eventually.

Here is the crazy calculation-
Revision #88215
88215 / 128^2 = 5 (hex 05) remainder2 6295 [3 bytes required]
remainder2 6295 / 128^1 = 49 remainder1 23 … 49 + 128 = 177 (hex B1)
remainder1 23 + 128 = 152 (hex 97)
88214 → 97 B1 05

Revision #16383
16383 / 128^2 = 0 remainder2 16383
remainder2 16383 / 128^1 = 127 (hex 7F) remainder1 127 [2 bytes required]
remainder1 127 + 128 = 255 (hex FF)
16383 → FF 7F

Revision #17
17 / 128^2 = 0 remainder2 17
remainder2 17 / 128^1 = 0 remainder1 17
remainder1 17 (hex 11) [1 byte required]
I guess you don’t add 128 if only 1 byte is required.
This matches the hex editor screenshot - revs file 17.

dmitry.pavlenko · November 9, 2022, 11:16pm

Afaik, yes. And yes, there’s only one L2P and one P2L at the footer.

Yes, this is a description of P2L entries.

Docs:

Phys-to-log index
=================

This index has to map offset -> (rev, item_index, type, len, checksum).

while L2P doesn’t contain a checksum:

Log-to-phys index
=================

This index has to map (rev, item_index) -> offset.

Regarding logical addressing, you’re right, physical addressing is obsolete. It’s just simpler, I used it to illustrate the idea and I didn’t know that you need to work with a certain repositories only.

You don’t have them, until you create a new revision with SVNKit. In the process of creating a new revision, SVNKit creates temporary files in “transactions/” and “txn-protorevs” directories. Once the transaction is committed, these directories are cleaned. So to catch the moment when they have the files, you should set a break point. Also note that many files (like “itemidx” or index files) are created on demand (on the first write).

Yes, exactly. It took me a while to understand how they all are calculated. If you got that math quickly, you must be very smart :)

Then you’ll be curious to know about one amazing bug I’ve run into on Linux: FLOCK (BSD lock) vs POSIX lock.

In the process of creating of a new revision SVNKit (before fixing this issue) took a standard Java lock (FileChannel#lock) which is a POSIX lock. Until some version, Subversion also took POSIX lock. So if one concurrently committed to the same repositroy using SVNKit and native Subversion, they were waiting for each other. But suddenly native Subversion has switched to FLOCK (BSD lock, default mechanism in APR library) and these 2 kinds of locks are transparent to each other, i.e. you can take FLOCK lock and POSIX lock on the same file and they wouldn’t wait for each other. But if 2 FLOCK locks or 2 POSIX locks see each other.

We’ve solved the problem in SVNKit by introducing double locking meachanism: SVNDoubleLock, i.e. we take both of locks, so now we don’t conflict with older versions of Subversion and SVNKit, as well as with their newer versoins. Before fixing this bug 2 concurrent threads created one crazy mutant revision, instead of 2 subsequent revisions.

mkatich · November 10, 2022, 10:01pm

Ok, but I don’t understand. You said a FNV-1a is calculated for a representation. Then a revision file can contain many representations. Why is only one FNV-1a checksum stored in the revision file? Which representation is the checksum for? Is it for one, or many representations?

P2L index - This index has to map offset → (rev, item_index, type, len, checksum).
L2P index - his index has to map (rev, item_index) → offset.

Oh, the P2L index itself contains a checksum also? Then the footer contains one checksum for L2P and one checksum for P2L. Wow this is confusing! But I do see there are more bytes between P2L-INDEX and the footer than there are bytes between L2P-INDEX and P2L-INDEX, so probably a 16-byte checksum fits in there.

I loaded all the SVNKit test files into a test program in my IDE, with JUnit, hamcrest.

I want to test generation of a FNV-1a hash (by SVNKit code), and display the result so I can check it. What I created gave an incorrect result so I have difficulty here.

byte[] inputBytes = hexStringToByteArray(inputHexBytes);
FSFnv1aInterleavedChecksumCalculator calculator = new FSFnv1aInterleavedChecksumCalculator();
calculator.update(inputBytes, 0, 0);
int checkumResultInt = calculator.finalizeChecksum();
System.out.println("checkumResultInt: "+checkumResultInt);
long checksumResultLong = (long)(checkumResultInt & 0x00000000ffffffffL);
System.out.println("checksumResultLong: "+checksumResultLong);
String checksumResultStr = longToHexBytesLittleEndian(checksumResultLong);
System.out.println("Fnv1aHash output: "+checksumResultStr);

checkumResultInt: -848455035
checksumResultLong: 3446512261
Fnv1aHash output: 859a6dcd00000000

My result is only 8 bytes and it is the same regardless of input (same int after finalizeChecksum()) :/

By the way, my method hexStringToByteArray() I have used many times. But I already tested reversing it and it is correct.

It seems strange that it works on an int since that limits range. Are there only 2,147,483,647 (*2) possible hash results? The hash result uses 16 bytes in the revisions files. It is likely I don’t understand something.
…

I also want to test reading an existing file (from revprops) similar to the test class FSFileTest.java. In FSFileTest.java I see that it writes String content to a File, then creates a new FSFile object for that. Then it gets SVNProperties by the FSFile.readProperties() method.

However, in my try, I receive “svn: E200002: Malformed file” error (FSFile line 228). It’s strange because my revprops file looks much like the example in FSFileTest.java. When I try to debug, I see it fails on the “END” line (kind is not K or D). Should I set allowEOF to be false? Then it would break out of the readProperties method.

By the way, I try running the tests but they all fail with “Unable to load properties resources: /org/tmatesoft/svn/test/test.properties and /org/tmatesoft/svn/test/test.properties.template”. I see that TestOptions references these files.

Can you describe what I need?

Hah, I don’t know. The calculation in FSPackedNumbersStream looks much more complicated! I think it would be more difficult to understand the calculation by reading that code than how I learned it. But it’s possible my calculation doesn’t account for every possibility.

dmitry.pavlenko · November 10, 2022, 11:08pm

So many questions, I’ll try to answer as many of them as I can now, and will answer to the rest later (as it’s nearly midnight here).

First of all, I must admit that when I was using L2P and L2P abbreviations, I could be misleading, as there’re 2 kind of L2P and P2L indexes:

“proto indexes”, i.e. index.p2l and index.l2p files that exist until the transaction is fixed; AND
L2P-INDEX and P2L-INDEX as the part of the footer of the “rev file”, let’s call them “rev indexes”.

They are now the same as you see from complexity of FSRoot#writeP2LIndex and FSRoot#writeL2PIndex; these methods convert “proto indexes” to “rev indexes”.

Mostly by P2L and L2P I meant “proto indexes”.

What about checksums, all these checksums: FNV-1a, MD5, SHA-1 are used in Subversion, but at different moments.

FNV-1a checksum type is “int”, i.e. it has 4 bytes. It is only used inside FSP2LEntry class. This FSP2LEntry class corresponds to P2L “proto index” (i.e. index.p2l) entry. Of course, the when the transaction is finished and the P2L “proto index” is converted to P2L “rev index”, the checksum goes to it as well.

By the way, to read “rev index”, FSLogicalAddressingIndex class is used, e.g. FSLogicalAddressingIndex#lookupP2LEntries returns List<FSP2LEntry>.

To write P2L “proto index”, FSP2LProtoIndex file is used.

There’s no other place where FNV-1a checksum is used.

The “rev indexes” have their own checksums, indeed, you’re right at this moment. You can see that from FSRoot#writeIndexData:

        final long l2pOffset = protoFile.getPosition();
        final String l2pChecksum = writeL2PIndex(protoFile, newRevision, txnId);
        final long p2lOffset = protoFile.getPosition();
        final String p2lChecksum = writeP2LIndex(protoFile, newRevision, txnId);

These checksums are MD5.

If you look inside FSRoot#writeP2LIndex (for FSRoot#writeP2LIndex this is also true), you’ll see

final SVNChecksumOutputStream checksumOutputStream = new SVNChecksumOutputStream(protoFile, SVNChecksumOutputStream.MD5_ALGORITHM, false);

i.e. the MD5-generating output stream is created for the current position of the “proto file” and then P2L “rev index” is written to it, starting with “P2L-INDEX\n” header (FSLogicalAddressingIndex.P2L_STREAM_PREFIX constant). So the checksum is just MD5 of the P2L “rev index”.

Once the checksums are calculated the footer is formed like the following (FSRoot#writeIndexData):

        final StringBuilder footerBuilder = new StringBuilder();
        footerBuilder.append(l2pOffset);
        footerBuilder.append(' ');
        footerBuilder.append(l2pChecksum);
        footerBuilder.append(' ');
        footerBuilder.append(p2lOffset);
        footerBuilder.append(' ');
        footerBuilder.append(p2lChecksum);
        final String footerString = footerBuilder.toString();

and then the footer length is also written.

Also MD5 and SHA-1 are used in representations: FSRepresentation#myMD5HexDigest and FSRepresentation#mySHA1HexDigest. If representation represents a file, they equal to the MD5 and SHA-1 of the content of the file. In contrast when FNV-1a was calculated, it was calculated for representation encoded in the way it is stored in the “rev file”. E.g. if I have a file with content “abc\n”, its MD5 and SHA-1 are:

$ md5sum file 
0bee89b07a248e27c83fc3d5951213c1  file
$ sha1sum file 
03cfd743661f07975fa2f1220c5194cbaff48451  file

and if I look inside the “rev file”, it starts with

00000000  44 45 4c 54 41 0a 53 56  4e 02 00 00 04 02 05 01  |DELTA.SVN.......|
00000010  84 04 61 62 63 0a 45 4e  44 52 45 50 0a 69 64 3a  |..abc.ENDREP.id:|
00000020  20 30 2d 31 2e 30 2e 72  31 2f 34 0a 74 79 70 65  | 0-1.0.r1/4.type|
00000030  3a 20 66 69 6c 65 0a 63  6f 75 6e 74 3a 20 30 0a  |: file.count: 0.|
00000040  74 65 78 74 3a 20 31 20  33 20 31 36 20 34 20 30  |text: 1 3 16 4 0|
00000050  62 65 65 38 39 62 30 37  61 32 34 38 65 32 37 63  |bee89b07a248e27c|
00000060  38 33 66 63 33 64 35 39  35 31 32 31 33 63 31 20  |83fc3d5951213c1 |
00000070  30 33 63 66 64 37 34 33  36 36 31 66 30 37 39 37  |03cfd743661f0797|
00000080  35 66 61 32 66 31 32 32  30 63 35 31 39 34 63 62  |5fa2f1220c5194cb|
00000090  61 66 66 34 38 34 35 31  20 30 2d 30 2f 5f 32 0a  |aff48451 0-0/_2.|
...

as you see, it contains both “0bee…” and “03cfd…” hashes as strings. But FNV-1a is probably calculated of … I’m not sure of which part of this file, but of some part of this “rev file”, not of just “abc”.

Ok, that’s all for today, to be continued.

By the way, you could be also interested in reading about “layout sharded” feature (preventing too many files in “db/revs” directory), otherwise for some large sharded repositories your code might fail.

dmitry.pavlenko · November 11, 2022, 10:42pm

I see that actually there not many questions uncovered.

Regarding test.properties file, here’s where the files reside in the sources svnkit - Revision 10854: /trunk/svnkit/src/test/resources/org/tmatesoft/svn/test

The tests expect svnkit/src/test/resources directory to get into classpath, so the test properties files are loaded from there.

I can reproduce “svn: E200002: Malformed file” and I would set allowEOF to false. If you find the occurrences of FSFile#readProperties, only at one place it is run with allowEOF = true, otherwise it’s false everywhere. I have no idea, why this parameter is called in such counter-intuitive way (maybe it should be named forbidEOF) and why it should be ever set to true, and when it is set to true — why it is set to true exactly at that place where it is set to true and the call nearby has the same parameter set to false. I would guess there’s something wrong with SVNKit code in the sense that this parameter was designed to impose more strict checks (that the file ends exactly where it should end, like after “END” in your “revprop file”). But then these checks were relaxed by setting the parameter to false. Then it should be called requireEOF.

Anyway, I don’t think I would do anything about it unless there’s a severe problem like “SVNKit can’t read an actual repository”.

So for your purposes, set the parameter to false. When I set it to false on my repository, the error goes away.

mkatich · November 15, 2022, 11:14pm

dmitry.pavlenko:

The “rev indexes” have their own checksums, indeed, you’re right at this moment. You can see that from FSRoot#writeIndexData:
        final long l2pOffset = protoFile.getPosition();
        final String l2pChecksum = writeL2PIndex(protoFile, newRevision, txnId);
        final long p2lOffset = protoFile.getPosition();
        final String p2lChecksum = writeP2LIndex(protoFile, newRevision, txnId);
These checksums are MD5.

All right, let me see if I understand this. The FNV-1a hash is used to make a 4 byte checksum for the P2L “proto index” entry/data/file. When the transaction is finished, the P2L “proto index” is converted to a P2L “rev index”. The FNV-1a hash is included in the “proto index”, so it is also incorporated into the resulting “rev index”.

Then there is an MD5 checksum calculated for the “rev index”.

Looking at FSRoot.writeL2PIndex(), I can see that LOTS of things are fed into the data used in the “message” to be hashed - revision number, L2P page size, page counts size, page sizes size, then more page counts and page sizes and entry counts in loops. After that, “return checksumOutputStream.getDigest();” calculates the MD5 digest. Wow I did not guess so many things would be included, but it is not just contents of a file.

So now I am clear. I cannot look at the rev files and see the data used to create the hashes there. They are derived from many things from previous work which aren’t in the file. And these are actually MD5 hashes after all that I see in the rev files, not FNV-1a.

To learn more, I would probably need to use SVNKit more thoroughly as you say - have SVNKit access a repository, set a breakpoint in a debugger in the middle of committing via file:/// protocol, look at index.l2p and index.p2l files, and I could look at more values that are used in creating the checksums.

I have recently learned that all zeroes is considered by Subversion to be a checksum match. So I overwrote my current test rev file so both L2P and P2L checksums are 16 bytes of zeroes. That remedied the error “L2P index checksum mismatch in file” (and P2L) in Subversion. So I may not have to learn how to completely replicate the checksum generation process for my purposes. That is “to be determined”.

By the way, your description of MD5 and SHA-1 usage in representations was very helpful. Thank you for the example including hex data!

Oh I did not realize I already had those. Ok, I am now able to run the tests! I have some more errors to look at.

Ok that is good to know, thank you. I will set it to false.

Thank you Dmitry. I greatly appreciate your input.

dmitry.pavlenko · November 15, 2022, 11:27pm

I confirm that everything you write is absolutely correct. Good luck with your project!