delta/go-git.git/src/hash, branch dev.inline

hash/crc32: cleanup code and improve tests

2016-08-31T15:17:57+00:00

Major reorganization of the crc32 code:

 - The arch-specific files now implement a well-defined interface
   (documented in crc32.go). They no longer have the responsibility of
   initializing and falling back to a non-accelerated implementation;
   instead, that happens in the higher level code.

 - The non-accelerated algorithms are moved to a separate file with no
   dependencies on other code.

 - The "cutoff" optimization for slicing-by-8 is moved inside the
   algorithm itself (as opposed to every callsite).

Tests are significantly improved:
 - direct tests for the non-accelerated algorithms.
 - "cross-check" tests for arch-specific implementations (all archs).
 - tests for misaligned buffers for both IEEE and Castagnoli.

Fixes #16909.

Change-Id: I9b6dd83b7a57cd615eae901c0a6d61c6b8091c74
Reviewed-on: https://go-review.googlesource.com/27935
Reviewed-by: Keith Randall

hash/crc32: fix nil Castagnoli table problem

2016-08-28T19:01:07+00:00

When SSE is available, we don't need the Table. However, it is
returned as a handle by MakeTable. Fix this to always generate
the table.

Further cleanup is discussed in #16909.

Change-Id: Ic05400d68c6b5d25073ebd962000451746137afc
Reviewed-on: https://go-review.googlesource.com/27934
Reviewed-by: Brad Fitzpatrick 
Run-TryBot: Brad Fitzpatrick 
TryBot-Result: Gobot Gobot

hash/crc32: improve the AMD64 implementation using SSE4.2

2016-08-28T01:39:03+00:00

The algorithm is explained in the comments. The improvement in
throughput is about 1.4x for buffers between 500b-4Kb and 2.5x-2.6x
for larger buffers.

Additionally, we no longer initialize the software tables if SSE4.2 is
available.

Adding a test for the SSE implementation (restricted to amd64 and
amd64p32).

Benchmarks on a Haswell i5-4670 @ 3.4 GHz:

name                           old time/op    new time/op     delta
CastagnoliCrc15B-4               21.9ns ± 1%     22.9ns ± 0%    +4.45%
CastagnoliCrc15BMisaligned-4     22.6ns ± 0%     23.4ns ± 0%    +3.43%
CastagnoliCrc40B-4               23.3ns ± 0%     23.9ns ± 0%    +2.58%
CastagnoliCrc40BMisaligned-4     25.4ns ± 0%     26.1ns ± 0%    +2.86%
CastagnoliCrc512-4               72.6ns ± 0%     52.8ns ± 0%   -27.33%
CastagnoliCrc512Misaligned-4     76.3ns ± 1%     56.3ns ± 0%   -26.18%
CastagnoliCrc1KB-4                128ns ± 1%       89ns ± 0%   -30.04%
CastagnoliCrc1KBMisaligned-4      130ns ± 0%       88ns ± 0%   -32.65%
CastagnoliCrc4KB-4                461ns ± 0%      187ns ± 0%   -59.40%
CastagnoliCrc4KBMisaligned-4      463ns ± 0%      191ns ± 0%   -58.77%
CastagnoliCrc32KB-4              3.58µs ± 0%     1.35µs ± 0%   -62.22%
CastagnoliCrc32KBMisaligned-4    3.58µs ± 0%     1.36µs ± 0%   -61.84%

name                           old speed      new speed       delta
CastagnoliCrc15B-4              684MB/s ± 1%    655MB/s ± 0%    -4.32%
CastagnoliCrc15BMisaligned-4    663MB/s ± 0%    641MB/s ± 0%    -3.32%
CastagnoliCrc40B-4             1.72GB/s ± 0%   1.67GB/s ± 0%    -2.69%
CastagnoliCrc40BMisaligned-4   1.58GB/s ± 0%   1.53GB/s ± 0%    -2.82%
CastagnoliCrc512-4             7.05GB/s ± 0%   9.70GB/s ± 0%   +37.59%
CastagnoliCrc512Misaligned-4   6.71GB/s ± 1%   9.09GB/s ± 0%   +35.43%
CastagnoliCrc1KB-4             7.98GB/s ± 1%  11.46GB/s ± 0%   +43.55%
CastagnoliCrc1KBMisaligned-4   7.86GB/s ± 0%  11.70GB/s ± 0%   +48.75%
CastagnoliCrc4KB-4             8.87GB/s ± 0%  21.80GB/s ± 0%  +145.69%
CastagnoliCrc4KBMisaligned-4   8.83GB/s ± 0%  21.39GB/s ± 0%  +142.25%
CastagnoliCrc32KB-4            9.15GB/s ± 0%  24.22GB/s ± 0%  +164.62%
CastagnoliCrc32KBMisaligned-4  9.16GB/s ± 0%  24.00GB/s ± 0%  +161.94%

Fixes #16107.

Change-Id: Ibe50ea76574674ce0571ef31c31015e0ed66b907
Reviewed-on: https://go-review.googlesource.com/27931
Run-TryBot: Brad Fitzpatrick 
TryBot-Result: Gobot Gobot 
Reviewed-by: Brad Fitzpatrick

Revert "hash/crc32: improve the AMD64 implementation using SSE4.2"

2016-08-27T16:49:02+00:00

This reverts commit 54d7de7dd62bab764125c48fd159bb938da078e1.

It was breaking non-amd64 builds.

Change-Id: I22650e922498eeeba3d4fa08bb4ea40a210c8f97
Reviewed-on: https://go-review.googlesource.com/27925
Reviewed-by: Keith Randall

hash/crc32: improve the AMD64 implementation using SSE4.2

2016-08-27T15:50:28+00:00

The algorithm is explained in the comments. The improvement in
throughput is about 1.4x for buffers between 500b-4Kb and 2.5x-2.6x
for larger buffers.

Additionally, we no longer initialize the software tables if SSE4.2 is
available.

Benchmarks on a Haswell i5-4670 @ 3.4 GHz:

name                           old time/op    new time/op     delta
CastagnoliCrc15B-4               21.9ns ± 1%     22.9ns ± 0%    +4.45%
CastagnoliCrc15BMisaligned-4     22.6ns ± 0%     23.4ns ± 0%    +3.43%
CastagnoliCrc40B-4               23.3ns ± 0%     23.9ns ± 0%    +2.58%
CastagnoliCrc40BMisaligned-4     25.4ns ± 0%     26.1ns ± 0%    +2.86%
CastagnoliCrc512-4               72.6ns ± 0%     52.8ns ± 0%   -27.33%
CastagnoliCrc512Misaligned-4     76.3ns ± 1%     56.3ns ± 0%   -26.18%
CastagnoliCrc1KB-4                128ns ± 1%       89ns ± 0%   -30.04%
CastagnoliCrc1KBMisaligned-4      130ns ± 0%       88ns ± 0%   -32.65%
CastagnoliCrc4KB-4                461ns ± 0%      187ns ± 0%   -59.40%
CastagnoliCrc4KBMisaligned-4      463ns ± 0%      191ns ± 0%   -58.77%
CastagnoliCrc32KB-4              3.58µs ± 0%     1.35µs ± 0%   -62.22%
CastagnoliCrc32KBMisaligned-4    3.58µs ± 0%     1.36µs ± 0%   -61.84%

name                           old speed      new speed       delta
CastagnoliCrc15B-4              684MB/s ± 1%    655MB/s ± 0%    -4.32%
CastagnoliCrc15BMisaligned-4    663MB/s ± 0%    641MB/s ± 0%    -3.32%
CastagnoliCrc40B-4             1.72GB/s ± 0%   1.67GB/s ± 0%    -2.69%
CastagnoliCrc40BMisaligned-4   1.58GB/s ± 0%   1.53GB/s ± 0%    -2.82%
CastagnoliCrc512-4             7.05GB/s ± 0%   9.70GB/s ± 0%   +37.59%
CastagnoliCrc512Misaligned-4   6.71GB/s ± 1%   9.09GB/s ± 0%   +35.43%
CastagnoliCrc1KB-4             7.98GB/s ± 1%  11.46GB/s ± 0%   +43.55%
CastagnoliCrc1KBMisaligned-4   7.86GB/s ± 0%  11.70GB/s ± 0%   +48.75%
CastagnoliCrc4KB-4             8.87GB/s ± 0%  21.80GB/s ± 0%  +145.69%
CastagnoliCrc4KBMisaligned-4   8.83GB/s ± 0%  21.39GB/s ± 0%  +142.25%
CastagnoliCrc32KB-4            9.15GB/s ± 0%  24.22GB/s ± 0%  +164.62%
CastagnoliCrc32KBMisaligned-4  9.16GB/s ± 0%  24.00GB/s ± 0%  +161.94%

Fixes #16107.

Change-Id: I8fa827ec03f708ba27ee71c833f7544ad9dc5bc3
Reviewed-on: https://go-review.googlesource.com/24471
Reviewed-by: Keith Randall

hash/crc32: fix optimized s390x implementation

2016-08-21T02:04:43+00:00

The code wasn't checking to see if the data was still >= 64 bytes
long after aligning it.

Aligning the data is an optimization and we don't actually need
to do it. In fact for smaller sizes it slows things down due to
the overhead of calling the generic function. Therefore for now
I have simply removed the alignment stage. I have also added a
check into the assembly to deliberately trigger a segmentation
fault if the data is too short.

Fixes #16779.

Change-Id: Ic01636d775efc5ec97689f050991cee04ce8fe73
Reviewed-on: https://go-review.googlesource.com/27409
Reviewed-by: Brad Fitzpatrick

hash/crc32: improve the processing of the last bytes in the SSE4.2 code for AMD64

2016-08-17T21:20:50+00:00

This commit improves the processing of the final few bytes in
castagnoliSSE42: instead of processing one byte at a time, we use all
versions of the CRC32 instruction to process 4 bytes, then 2, then 1.
The difference is only noticeable for small "odd" sized buffers.

We do the similar improvement for processing the first few bytes in
the case of unaligned buffer.

Fixing the test which was not actually verifying the results for
misaligned buffers (WriteString was creating an internal copy which
was aligned).

Adding benchmarks for length 15 (aligned and misaligned), results
below.

name                          old time/op    new time/op    delta
CastagnoliCrc15B-4              25.1ns ± 0%    22.1ns ± 1%  -12.14%
CastagnoliCrc15BMisaligned-4    25.2ns ± 0%    22.9ns ± 1%   -9.03%
CastagnoliCrc40B-4              23.1ns ± 0%    23.4ns ± 0%   +1.08%
CastagnoliCrc1KB-4               127ns ± 0%     128ns ± 0%   +1.18%
CastagnoliCrc4KB-4               462ns ± 0%     464ns ± 0%     ~
CastagnoliCrc32KB-4             3.58µs ± 0%    3.60µs ± 0%   +0.58%

name                          old speed      new speed      delta
CastagnoliCrc15B-4             597MB/s ± 0%   679MB/s ± 1%  +13.77%
CastagnoliCrc15BMisaligned-4   596MB/s ± 0%   655MB/s ± 1%   +9.94%
CastagnoliCrc40B-4            1.73GB/s ± 0%  1.71GB/s ± 0%   -1.14%
CastagnoliCrc1KB-4            8.01GB/s ± 0%  7.93GB/s ± 1%   -1.06%
CastagnoliCrc4KB-4            8.86GB/s ± 0%  8.83GB/s ± 0%     ~
CastagnoliCrc32KB-4           9.14GB/s ± 0%  9.09GB/s ± 0%   -0.58%

Change-Id: I499e37af2241d28e3e5d522bbab836c1a718430a
Reviewed-on: https://go-review.googlesource.com/24470
Reviewed-by: Keith Randall

hash/crc64: Use slicing by 8.

2016-05-18T14:38:04+00:00

Similar to crc32 slicing by 8.
This also fixes a Crc64KB benchmark actually using 1024 bytes.

Crc64/ISO64KB-4       147µs ± 0%      37µs ± 0%   -75.05%  (p=0.000 n=18+18)
Crc64/ISO4KB-4       9.19µs ± 0%    2.33µs ± 0%   -74.70%  (p=0.000 n=19+20)
Crc64/ISO1KB-4       2.31µs ± 0%    0.60µs ± 0%   -73.81%  (p=0.000 n=19+15)
Crc64/ECMA64KB-4      147µs ± 0%      37µs ± 0%   -75.05%  (p=0.000 n=20+20)
Crc64/Random64KB-4    147µs ± 0%      41µs ± 0%   -72.17%  (p=0.000 n=20+18)
Crc64/Random16KB-4   36.7µs ± 0%    36.5µs ± 0%    -0.54%  (p=0.000 n=18+19)

name                old speed     new speed      delta
Crc64/ISO64KB-4     446MB/s ± 0%  1788MB/s ± 0%  +300.72%  (p=0.000 n=18+18)
Crc64/ISO4KB-4      446MB/s ± 0%  1761MB/s ± 0%  +295.20%  (p=0.000 n=18+20)
Crc64/ISO1KB-4      444MB/s ± 0%  1694MB/s ± 0%  +281.46%  (p=0.000 n=19+20)
Crc64/ECMA64KB-4    446MB/s ± 0%  1788MB/s ± 0%  +300.77%  (p=0.000 n=20+20)
Crc64/Random64KB-4  446MB/s ± 0%  1603MB/s ± 0%  +259.32%  (p=0.000 n=20+18)
Crc64/Random16KB-4  446MB/s ± 0%   448MB/s ± 0%    +0.54%  (p=0.000 n=18+20)

Change-Id: I1c7621d836c486d6bfc41dbe1ec2ff9ab11aedfc
Reviewed-on: https://go-review.googlesource.com/22222
Run-TryBot: Ilya Tocar 
TryBot-Result: Gobot Gobot 
Reviewed-by: Russ Cox

hash/crc32: use vector instructions on s390x

2016-04-22T18:07:15+00:00

The input buffer is aligned to a doubleword boundary to
improve performance of the vector instructions. The pure
Go implementation is used to align the input data, and is
also used when the vector instructions are not available
or the data length is less than 64 bytes.

Change-Id: Ie259a5f2f1562bcc17961c99e5776c99091d6bed
Reviewed-on: https://go-review.googlesource.com/22201
Reviewed-by: Michael Munday 
Run-TryBot: Michael Munday 
TryBot-Result: Gobot Gobot 
Reviewed-by: Bill O'Farrell 
Reviewed-by: Brad Fitzpatrick

hash/adler32: Unroll loop for extra performance.

2016-04-15T10:17:17+00:00

name         old time/op    new time/op    delta
Adler32KB-4     592ns ± 0%     447ns ± 0%  -24.49%  (p=0.000 n=19+20)

name         old speed      new speed      delta
Adler32KB-4  1.73GB/s ± 0%  2.29GB/s ± 0%  +32.41%  (p=0.000 n=20+20)

Change-Id: I38990aa66ca4452a886200018a57c0bc3af30717
Reviewed-on: https://go-review.googlesource.com/21880
Reviewed-by: Keith Randall 
Run-TryBot: Ilya Tocar 
TryBot-Result: Gobot Gobot