<feed xmlns='http://www.w3.org/2005/Atom'>
<title>delta/libgit2.git/src/diff_file.c, branch ethomson/winhttp</title>
<subtitle>github.com: libgit2/libgit2.git
</subtitle>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/'/>
<entry>
<title>diff_generate: avoid excessive stats of .gitattribute files</title>
<updated>2018-01-03T16:28:24+00:00</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2018-01-03T16:07:36+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=d8896bda5c43616f3c755242703fce7c2a97ad67'/>
<id>d8896bda5c43616f3c755242703fce7c2a97ad67</id>
<content type='text'>
When generating a diff between two trees, for each file that is to be
diffed we have to determine whether it shall be treated as text or as
binary files. While git has heuristics to determine which kind of diff
to generate, users can also that default behaviour by setting or
unsetting the 'diff' attribute for specific files.

Because of that, we have to query gitattributes in order to determine
how to diff the current files. Instead of hitting the '.gitattributes'
file every time we need to query an attribute, which can get expensive
especially on networked file systems, we try to cache them instead. This
works perfectly fine for every '.gitattributes' file that is found, but
we hit cache invalidation problems when we determine that an attribuse
file is _not_ existing. We do create an entry in the cache for missing
'.gitattributes' files, but as soon as we hit that file again we
invalidate it and stat it again to see if it has now appeared.

In the case of diffing large trees with each other, this behaviour is
very suboptimal. For each pair of files that is to be diffed, we will
repeatedly query every directory component leading towards their
respective location for an attributes file. This leads to thousands or
even hundreds of thousands of wasted syscalls.

The attributes cache already has a mechanism to help in that scenario in
form of the `git_attr_session`. As long as the same attributes session
is still active, we will not try to re-query the gitmodules files at all
but simply retain our currently cached results. To fix our problem, we
can create a session at the top-most level, which is the initialization
of the `git_diff` structure, and use it in order to look up the correct
diff driver. As the `git_diff` structure is used to generate patches for
multiple files at once, this neatly solves our problem by retaining the
session until patches for all files have been generated.

The fix has been tested with linux.git by calling
`git_diff_tree_to_tree` and `git_diff_to_buf` with v4.10^{tree} and
v4.14^{tree}.

                | time    | .gitattributes stats
    without fix | 33.201s | 844614
    with fix    | 30.327s | 4441

While execution only improved by roughly 10%, the stat(3) syscalls for
.gitattributes files decreased by 99.5%. The benchmarks were quite
simple with best-of-three timings on Linux ext4 systems. One can assume
that for network based file systems the performance gain will be a lot
larger due to a much higher latency.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When generating a diff between two trees, for each file that is to be
diffed we have to determine whether it shall be treated as text or as
binary files. While git has heuristics to determine which kind of diff
to generate, users can also that default behaviour by setting or
unsetting the 'diff' attribute for specific files.

Because of that, we have to query gitattributes in order to determine
how to diff the current files. Instead of hitting the '.gitattributes'
file every time we need to query an attribute, which can get expensive
especially on networked file systems, we try to cache them instead. This
works perfectly fine for every '.gitattributes' file that is found, but
we hit cache invalidation problems when we determine that an attribuse
file is _not_ existing. We do create an entry in the cache for missing
'.gitattributes' files, but as soon as we hit that file again we
invalidate it and stat it again to see if it has now appeared.

In the case of diffing large trees with each other, this behaviour is
very suboptimal. For each pair of files that is to be diffed, we will
repeatedly query every directory component leading towards their
respective location for an attributes file. This leads to thousands or
even hundreds of thousands of wasted syscalls.

The attributes cache already has a mechanism to help in that scenario in
form of the `git_attr_session`. As long as the same attributes session
is still active, we will not try to re-query the gitmodules files at all
but simply retain our currently cached results. To fix our problem, we
can create a session at the top-most level, which is the initialization
of the `git_diff` structure, and use it in order to look up the correct
diff driver. As the `git_diff` structure is used to generate patches for
multiple files at once, this neatly solves our problem by retaining the
session until patches for all files have been generated.

The fix has been tested with linux.git by calling
`git_diff_tree_to_tree` and `git_diff_to_buf` with v4.10^{tree} and
v4.14^{tree}.

                | time    | .gitattributes stats
    without fix | 33.201s | 844614
    with fix    | 30.327s | 4441

While execution only improved by roughly 10%, the stat(3) syscalls for
.gitattributes files decreased by 99.5%. The benchmarks were quite
simple with best-of-three timings on Linux ext4 systems. One can assume
that for network based file systems the performance gain will be a lot
larger due to a much higher latency.
</pre>
</div>
</content>
</entry>
<entry>
<title>diff_file: properly refcount blobs when initializing file contents</title>
<updated>2017-12-15T10:52:13+00:00</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2017-12-15T10:47:01+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=2388a9e2ab0516c2a9146a1c4d15ced3052fef4c'/>
<id>2388a9e2ab0516c2a9146a1c4d15ced3052fef4c</id>
<content type='text'>
When initializing a `git_diff_file_content` from a source whose data is
derived from a blob, we simply assign the blob's pointer to the
resulting struct without incrementing its refcount. Thus, the structure
can only be used as long as the blob is kept alive by the caller.

Fix the issue by using `git_blob_dup` instead of a direct assignment.
This function will increment the refcount of the blob without allocating
new memory, so it does exactly what we want. As
`git_diff_file_content__unload` already frees the blob when
`GIT_DIFF_FLAG__FREE_BLOB` is set, we don't need to add new code
handling the free but only have to set that flag correctly.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When initializing a `git_diff_file_content` from a source whose data is
derived from a blob, we simply assign the blob's pointer to the
resulting struct without incrementing its refcount. Thus, the structure
can only be used as long as the blob is kept alive by the caller.

Fix the issue by using `git_blob_dup` instead of a direct assignment.
This function will increment the refcount of the blob without allocating
new memory, so it does exactly what we want. As
`git_diff_file_content__unload` already frees the blob when
`GIT_DIFF_FLAG__FREE_BLOB` is set, we don't need to add new code
handling the free but only have to set that flag correctly.
</pre>
</div>
</content>
</entry>
<entry>
<title>Make sure to always include "common.h" first</title>
<updated>2017-07-03T08:51:48+00:00</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2017-06-30T11:39:01+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=0c7f49dd4316b332f30b4ea72a657bace41e1245'/>
<id>0c7f49dd4316b332f30b4ea72a657bace41e1245</id>
<content type='text'>
Next to including several files, our "common.h" header also declares
various macros which are then used throughout the project. As such, we
have to make sure to always include this file first in all
implementation files. Otherwise, we might encounter problems or even
silent behavioural differences due to macros or defines not being
defined as they should be. So in fact, our header and implementation
files should make sure to always include "common.h" first.

This commit does so by establishing a common include pattern. Header
files inside of "src" will now always include "common.h" as its first
other file, separated by a newline from all the other includes to make
it stand out as special. There are two cases for the implementation
files. If they do have a matching header file, they will always include
this one first, leading to "common.h" being transitively included as
first file. If they do not have a matching header file, they instead
include "common.h" as first file themselves.

This fixes the outlined problems and will become our standard practice
for header and source files inside of the "src/" from now on.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Next to including several files, our "common.h" header also declares
various macros which are then used throughout the project. As such, we
have to make sure to always include this file first in all
implementation files. Otherwise, we might encounter problems or even
silent behavioural differences due to macros or defines not being
defined as they should be. So in fact, our header and implementation
files should make sure to always include "common.h" first.

This commit does so by establishing a common include pattern. Header
files inside of "src" will now always include "common.h" as its first
other file, separated by a newline from all the other includes to make
it stand out as special. There are two cases for the implementation
files. If they do have a matching header file, they will always include
this one first, leading to "common.h" being transitively included as
first file. If they do not have a matching header file, they instead
include "common.h" as first file themselves.

This fixes the outlined problems and will become our standard practice
for header and source files inside of the "src/" from now on.
</pre>
</div>
</content>
</entry>
<entry>
<title>giterr_set: consistent error messages</title>
<updated>2016-12-29T12:26:03+00:00</updated>
<author>
<name>Edward Thomson</name>
<email>ethomson@github.com</email>
</author>
<published>2016-12-29T12:25:15+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=909d5494368a00809bc42f4780e86f4dd66e4422'/>
<id>909d5494368a00809bc42f4780e86f4dd66e4422</id>
<content type='text'>
Error messages should be sentence fragments, and therefore:

1. Should not begin with a capital letter,
2. Should not conclude with punctuation, and
3. Should not end a sentence and begin a new one
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Error messages should be sentence fragments, and therefore:

1. Should not begin with a capital letter,
2. Should not conclude with punctuation, and
3. Should not end a sentence and begin a new one
</pre>
</div>
</content>
</entry>
<entry>
<title>git_diff_generated: abstract generated diffs</title>
<updated>2016-05-26T18:01:08+00:00</updated>
<author>
<name>Edward Thomson</name>
<email>ethomson@github.com</email>
</author>
<published>2016-04-19T19:12:18+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=9be638ecf0d64ec98b3fd16f2d983a86d1aca131'/>
<id>9be638ecf0d64ec98b3fd16f2d983a86d1aca131</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>diff: include oid length in deltas</title>
<updated>2016-05-26T18:01:05+00:00</updated>
<author>
<name>Edward Thomson</name>
<email>ethomson@microsoft.com</email>
</author>
<published>2015-09-22T22:25:03+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=d68cb736776e0f2f9494b49e2da30a9c4b9fc2c7'/>
<id>d68cb736776e0f2f9494b49e2da30a9c4b9fc2c7</id>
<content type='text'>
Now that `git_diff_delta` data can be produced by reading patch
file data, which may have an abbreviated oid, allow consumers to
know that the id is abbreviated.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now that `git_diff_delta` data can be produced by reading patch
file data, which may have an abbreviated oid, allow consumers to
know that the id is abbreviated.
</pre>
</div>
</content>
</entry>
<entry>
<title>diff: on win32, treat fake "symlinks" specially</title>
<updated>2015-11-03T17:06:49+00:00</updated>
<author>
<name>Edward Thomson</name>
<email>ethomson@microsoft.com</email>
</author>
<published>2015-11-03T14:43:18+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=6b0fc6abc159c6f15f49bf5ab40b1152d8c6165f'/>
<id>6b0fc6abc159c6f15f49bf5ab40b1152d8c6165f</id>
<content type='text'>
On platforms that lack `core.symlinks`, we should not go looking for
symbolic links and `p_readlink` their target.  Instead, we should
examine the file's contents.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
On platforms that lack `core.symlinks`, we should not go looking for
symbolic links and `p_readlink` their target.  Instead, we should
examine the file's contents.
</pre>
</div>
</content>
</entry>
<entry>
<title>Rename FALLBACK to UNSPECIFIED</title>
<updated>2015-06-25T10:48:44+00:00</updated>
<author>
<name>Carlos Martín Nieto</name>
<email>cmn@dwim.me</email>
</author>
<published>2015-06-25T10:48:44+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=c2418f461374a618504c261a4c71cdb01bab9f68'/>
<id>c2418f461374a618504c261a4c71cdb01bab9f68</id>
<content type='text'>
Fallback describes the mechanism, while unspecified explains what the
user is thinking.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Fallback describes the mechanism, while unspecified explains what the
user is thinking.
</pre>
</div>
</content>
</entry>
<entry>
<title>submodule: add an ignore option to status</title>
<updated>2015-06-22T15:02:55+00:00</updated>
<author>
<name>Carlos Martín Nieto</name>
<email>cmn@dwim.me</email>
</author>
<published>2015-05-04T15:29:12+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=c6f489c964bc4df29bdacb1ee4afdcdb294f3815'/>
<id>c6f489c964bc4df29bdacb1ee4afdcdb294f3815</id>
<content type='text'>
This lets us specify in the status call which ignore rules we want to
use (optionally falling back to whatever the submodule has in its
configuration).

This removes one of the reasons for having `_set_ignore()` set the value
in-memory. We re-use the `IGNORE_RESET` value for this as it is no
longer relevant but has a similar purpose to `IGNORE_FALLBACK`.

Similarly, we remove `IGNORE_DEFAULT` which does not have use outside of
initializers and move that to fall back to the configuration as well.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This lets us specify in the status call which ignore rules we want to
use (optionally falling back to whatever the submodule has in its
configuration).

This removes one of the reasons for having `_set_ignore()` set the value
in-memory. We re-use the `IGNORE_RESET` value for this as it is no
longer relevant but has a similar purpose to `IGNORE_FALLBACK`.

Similarly, we remove `IGNORE_DEFAULT` which does not have use outside of
initializers and move that to fall back to the configuration as well.
</pre>
</div>
</content>
</entry>
<entry>
<title>submodule: don't let status change an existing instance</title>
<updated>2015-06-22T15:02:55+00:00</updated>
<author>
<name>Carlos Martín Nieto</name>
<email>cmn@dwim.me</email>
</author>
<published>2015-05-04T15:09:21+00:00</published>
<link rel='alternate' type='text/html' href='http://91.123.203.49/cgit/delta/libgit2.git/commit/?id=64bbd47a32e6aaed539bafd109eef624f24fbae7'/>
<id>64bbd47a32e6aaed539bafd109eef624f24fbae7</id>
<content type='text'>
As submodules are becomes more like values, we should not let a status
check to update its properties. Instead of taking a submodule, have
status take a repo and submodule name.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
As submodules are becomes more like values, we should not let a status
check to update its properties. Instead of taking a submodule, have
status take a repo and submodule name.
</pre>
</div>
</content>
</entry>
</feed>
