Box backend incompatible with Box API changes being deployed

Hi,

We started seeing Box transfers get permanently stuck (zero throughput for many hours) after beign triggered by a single error response on March 15 as the Box server-side change below was being tested.

This would probably have been retried and recovered had it not been for the second issue ( https://forum.rclone.org/t/latest-rclone-box-backend-hangs-when-max-duration-hit/22927 ) that makes it loop with zero throughput until killed.

These are 2 problems. The API compatibility issue in the backend is still seen as of version 1.54.1 and manifests itself intermittently with..

2021/03/20 03:30:27 DEBUG : Couldn't decode error response: invalid character '<' looking for beginning of value^M
2021/03/20 03:30:27 DEBUG : pacer: low level retry 1/1 (error Error "504 Gateway Time-out" (504))^M

Hello Box Custom App Owner,
As previously communicated, we're reaching out to notify you of a change that may affect yourBox application(s). If you've already taken corrective actions for your applications based onprevious communication, please ignore this message.
On May 10th, 2021, as part of our continued infrastructure upgrade, Box's
API responseheaders
will standardize to return in a case insensitive manner, in line with
industry bestpractices
and our API documentation. Applications that are using these headers, such as"location" and "retry-after", will need to verify that their applications are checking for theseheaders in a case-insensitive fashion. Over the past 12 months Box has been upgrading itsnetworking and observability infrastructure. These upgrades aim to improve reliability andavailability of Box's products for our customers. This specific change allows Box to deploy anupdated service proxy, enabling Box to more easily monitor service traffic, quickly find problemareas, and tune application performance.
You have been identified as an application owner (developer, support, or admin) that may beimpacted by this change.

How to verify whether your application(s) will be impacted
Verifying whether your application(s) will be impacted will require a review of your code. If youare solely using one of the
official Box SDKs
to connect to Box APIs, you will not be impactedas the SDKs handle response headers in a case-insensitive manner.
You may verify whether you are impacted with the following steps:
Locate the code in your application which handles the responses from Box API requests. If youare not extracting the
response headers
from these responses, you are not impacted.
If you are extracting those response headers, you may be impacted if you are expecting thoseheaders in a case-sensitive fashion.
For instance, if you are expecting the "Location" header to be returned with a starting capital"L", a direct string comparison with a specific case, then your code will need to be altered toprevent it from breaking after the change is made.
How to make the change
To update the impacted application(s), ensure that you are checking these response headers ina case-insensitive manner. Specifically, your code should be built in a defensive way to ensurethat if a header, such as "Location", is sent with an initial capital "L" or a lowercase "l", it'll behandled in the same way. For example, forcing all headers to lowercase before checking, orusing a case-insensitive regex string check, are two viable methods for handling this process.
Additional dates to be aware of
In addition to the date that this change will take effect, May 10th, 2021, we will be running twotests to ensure that we can identify and support applications that may be impacted. Duringthese tests, if your application is impacted, you may see an increase in error rates whenchecking for those headers. The current dates of these tests, which may shift, are:
March 15th, 2021
April 22nd, 2021
If you see an increase in failure rates during these periods, your application(s) will be impactedon May 10th, 2021 and will need to be adjusted using the above steps.

Rclone uses the go standard library handling for HTTP headers which already normalises the case so I didn't think this required any changes from rclone.

However I audited the box backend code and I discovered a place where we weren't using the correct accessor method which could explain your problems if they were caused by the token expiring during an upload.

Can you try this to see if it fixes the problem? If it doesn't then a log with -vv --dump headers would be really helpful

v1.55.0-beta.5331.20c3751ca.fix-box-headers on branch fix-box-headers (uploaded in 15-30 mins)

Hi Nick,

Sure, I'll give that a try soon, but we may not see the real outcome until Rclone does their next test around April 22. I still don't know which of these issues contributes to entering the "stuck" state.

I tracked down what seems to be triggering one of the errors that SEEMED to correlate with the stuck state during the trial around March 15, but I don't know if it is just a symptom or a cause. It may well be a minor issue just seen at the same time rather than the cause.

2021/03/22 10:17:08 DEBUG : Couldn't decode error response: invalid character '<' looking for beginning of value
cases...

2021/03/22 10:17:08 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/03/22 10:17:08 DEBUG : Couldn't decode error response: invalid character '<' looking for beginning of value
2021/03/22 10:17:08 DEBUG : pacer: low level retry 1/10 (error Error "502 Bad Gateway" (502))
2021/03/22 10:17:08 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2021/03/22 10:17:08 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/03/22 10:17:08 DEBUG : HTTP REQUEST (req 0xc008fda700)
2021/03/22 10:17:08 DEBUG : GET /2.0/folders/133959134503/items?fields=type%2Cid%2Csequence_id%2Cetag%2Csha1%2Cname%2Csize%2Ccreated_at%2Cmodified_at%2Ccontent_created_at%2Ccontent_modified_at%2Citem_status%2Cshared_link&limit=1000&offset=0 HTTP/1.1^M

And this seems to be a result of missing the Content-Type: text/html and attempting to process the HTML error response with a JSON parser...

2021/03/22 10:17:07 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/03/22 10:17:08 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/03/22 10:17:08 DEBUG : HTTP RESPONSE (req 0xc008994c00)
2021/03/22 10:17:08 DEBUG : HTTP/1.1 502 Bad Gateway^M
Transfer-Encoding: chunked^M
Box-Request-Id: 0f943c55fb470e859d0051664b22cbee3^M
Connection: keep-alive^M
Content-Type: text/html^M
Date: Mon, 22 Mar 2021 17:17:08 GMT^M
Etag: "6054f143-ed6c"^M
Strict-Transport-Security: max-age=31536000^M
^M
ed6c^M
<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<!--[if lte IE 8]>
<![CDATA[
Content-Type: multipart/related; boundary="_"

--_
Content-Location:bg
Content-Transfer-Encoding:base64

R0lGODlhAQDOAtUAAP7///3+//n8/vf7/vz+//T5/f////j7/uz1+/X6/e72/Onz++Du+ev0+/b6/efy+vD3/OXx+vr8/vv9/vv9/+Pw+eTw+e/2/Pr9/vP4/fL4/OHv+fH4/Pz9//P5/er0++31+9/u+PH3/Ojy+t/t+Oby+urz+97t+P7+/+Lv+e32/PL4/ejz+uTx+uLw+e31/O/3/PX5/d/u+

OK, I'm going to merge that fix anyway so it gets in the v1.55 release

That is caused by the box gateway timing out and emitting HTML instead of the JSON rclone is expecting.

Do you have a complete log for the transfer? I could examine it and see what the root cause?

If you want to share a link privately then private message me here or email me on nick@craig-wood.com

Hi Nick,

I'm still not sure if this is related at all to the more serious problem. For that, I am still working with our team to push on Box's support people to get access to a test instance or at least get an exact time for their second trial in April before they push out the change persistently in May. Their position is that they don't support Rclone and they seem blissfully unaware that NONE of their supported API clients are remotely adequate.

For the impending Box server change, the most troublesome issue is that rclone can get stuck in the "no progress" state forever. For that, I am hoping that the similar looking no-progress state when using the max-duration will shed light on it.

So, back to the non-JSON responses. I suspect that Box occasionally throws an error due to capacity issues on their end. When they do that, they switch from a content type of text/json to text/plain or text/html. At that point, rclone still seems to attempt to parse the response as JSON and throws an error. The only reason I suspect this of being anything but a cosmetic issue is that it was associated with the persistent no-progress state when Box did their trial upgrade on approximately March 15.

I'll send you a link to the full logs privately.

2021/03/22 20:04:47 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/03/22 20:04:47 DEBUG : bcm94910GW_CPE_WL20D1D1_nand_cferom_fs_image_256_pureubi.w: Uploading part 1/8 offset 0/61.500M part size 8M
2021/03/22 20:04:47 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/03/22 20:04:47 DEBUG : HTTP REQUEST (req 0xc000cfcd00)
2021/03/22 20:04:47 DEBUG : PUT /api/2.0/files/upload_sessions/63006AC5B4302682931A9C7BEB48A263 HTTP/1.1^M
Host: upload.box.com^M
User-Agent: rclone/v1.54.1^M
Content-Length: 8388608^M
Authorization: XXXX
Content-Range: bytes 0-8388607/64487444^M
Content-Type: application/octet-stream^M
Digest: sha=21yVwTn7zjQYTA4th1h0qxdrXrg=^M
Accept-Encoding: gzip^M
^M

<snip>


2021/03/22 20:04:48 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/03/22 20:04:48 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/03/22 20:04:48 DEBUG : HTTP RESPONSE (req 0xc000cfcd00)
2021/03/22 20:04:48 DEBUG : HTTP/1.1 503 Service Unavailable^M
Content-Length: 84^M
Connection: keep-alive^M
Content-Type: text/plain; charset=UTF-8^M
Date: Tue, 23 Mar 2021 03:04:48 GMT^M
Strict-Transport-Security: max-age=31536000^M
^M
upstream connect error or disconnect/reset before headers. reset reason: local reset
2021/03/22 20:04:48 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/03/22 20:04:48 DEBUG : Couldn't decode error response: invalid character 'u' looking for beginning of value
2021/03/22 20:04:48 DEBUG : pacer: low level retry 1/10 (error Error "503 Service Unavailable" (503))
2021/03/22 20:04:48 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2021/03/22 20:04:48 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/03/22 20:04:48 DEBUG : HTTP RESPONSE (req 0xc005eeb300)
2021/03/22 20:04:48 DEBUG : HTTP/1.1 200 OK^M
Transfer-Encoding: chunked^M
Box-Request-Id: 0a96a938a833899b01edbbe0401e41f6f^M

There is a fix for max-duration getting stuck indefinitely in the latest beta (see Fix excessive retries missing --max-duration timeout - fixes #4504 · rclone/rclone@4013bc4 · GitHub)

I think the change I fixed above would have caused indefinite retries when the token expired. Not sure why it caused 504 errors but maybe that is the ultimate fallback from box or maybe capacity errors like you said.

Thanks - that should help

Hi Nick,

I added a log file using the newer beta (Not the one from the new branch yet) and captured the headers and the stuck state. Look for the "duration.txt" file in the same Box folder I shared with you.

-Joel

2021/03/24 11:25:07 DEBUG : Using config file from "/home/joel/.config/rclone/rclone.conf"
2021/03/24 11:25:07 DEBUG : rclone: Version "v1.55.0-beta.5318.2bdbf00fa" starting with parameters ["rclone-v1.55.0-beta.5318.2bdbf00fa-linux-amd64/rclone" "copy" "-vvvv" "--dump" "headers,responses" "--max-duration" "3m" "customer_images/" "Box1:test1"]

Thanks for the logs. Will investigate the big log tomorrow.

The duration log shows a definite problem! Can you wait until it appears to be stuck (so after it gets the last valid HTTP response) then kill -QUIT the process and post the backtrace? That will show me where rclone is locked up - thanks.

Hi Nick,

Tried it again and it didn't fail (ended correctly after deadline), then repeated and caught the failure (duration3.txt in the same Box share)


SIGQUIT: quit
PC=0x473fa1 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x2d23cf0, 0x80, 0x0, 0x0, 0x0, 0x0, 0x7ffc1d6f9508, 0x1018, 0x7ffc1d6f9518, 0x40c3ff, ...)
	runtime/sys_linux_amd64.s:579 +0x21
runtime.futexsleep(0x2d23cf0, 0x0, 0xffffffffffffffff)
	runtime/os_linux.go:44 +0x46
runtime.notesleep(0x2d23cf0)
	runtime/lock_futex.go:159 +0x9f
runtime.mPark()
	runtime/proc.go:1340 +0x39
runtime.stopm()
	runtime/proc.go:2257 +0x92
runtime.findrunnable(0xc00005b000, 0x0)
	runtime/proc.go:2916 +0x72e
runtime.schedule()
	runtime/proc.go:3125 +0x2d7
runtime.park_m(0xc000000c00)
	runtime/proc.go:3274 +0x9d
runtime.mcall(0x0)
	runtime/asm_amd64.s:327 +0x5b

goroutine 1 [semacquire, 5 minutes]:
sync.runtime_Semacquire(0xc000678508)
	runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc000678500)
	sync/waitgroup.go:130 +0x65
github.com/rclone/rclone/fs/sync.(*syncCopyMove).stopTransfers(0xc000678400)
	github.com/rclone/rclone/fs/sync/sync.go:430 +0x93
github.com/rclone/rclone/fs/sync.(*syncCopyMove).run(0xc000678400, 0xc0000400b0, 0x214bfb8)
	github.com/rclone/rclone/fs/sync/sync.go:865 +0x3b0
github.com/rclone/rclone/fs/sync.runSyncCopyMove(0x21375e8, 0xc0000400b0, 0x214bfb8, 0xc0003eaea0, 0x214cf08, 0xc0003eacf0, 0x0, 0x1bb2980, 0xc000720090)
	github.com/rclone/rclone/fs/sync/sync.go:1078 +0x299
github.com/rclone/rclone/fs/sync.CopyDir(...)
	github.com/rclone/rclone/fs/sync/sync.go:1089
github.com/rclone/rclone/cmd/copy.glob..func1.1(0xc000720090, 0xc0000400b0)
	github.com/rclone/rclone/cmd/copy/copy.go:85 +0x85
github.com/rclone/rclone/cmd.Run(0xc000420101, 0x2d09080, 0xc000711d30)
	github.com/rclone/rclone/cmd/cmd.go:251 +0xed
github.com/rclone/rclone/cmd/copy.glob..func1(0x2d09080, 0xc00042c150, 0x2, 0x7)
	github.com/rclone/rclone/cmd/copy/copy.go:83 +0x12a
github.com/spf13/cobra.(*Command).execute(0x2d09080, 0xc00042c0e0, 0x7, 0x7, 0x2d09080, 0xc00042c0e0)
	github.com/spf13/cobra@v1.1.1/command.go:854 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x2d04ee0, 0x0, 0x0, 0xc00006e778)
	github.com/spf13/cobra@v1.1.1/command.go:958 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.1.1/command.go:895
github.com/rclone/rclone/cmd.Main()
	github.com/rclone/rclone/cmd/cmd.go:550 +0xad
main.main()
	github.com/rclone/rclone/rclone.go:14 +0x25

goroutine 19 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0001e5300)
	go.opencensus.io@v0.22.6/stats/view/worker.go:276 +0xcd
created by go.opencensus.io/stats/view.init.0
	go.opencensus.io@v0.22.6/stats/view/worker.go:34 +0x68

goroutine 6 [chan receive, 5 minutes]:
github.com/rclone/rclone/lib/oauthutil.(*Renew).renewOnExpiry(0xc000259500)
	github.com/rclone/rclone/lib/oauthutil/renew.go:39 +0x59
created by github.com/rclone/rclone/lib/oauthutil.NewRenew
	github.com/rclone/rclone/lib/oauthutil/renew.go:28 +0x85

goroutine 2786 [chan receive, 2 minutes]:
github.com/rclone/rclone/backend/local.(*fadvise).worker(0xc000afe980)
	github.com/rclone/rclone/backend/local/fadvise_unix.go:114 +0x5a
created by github.com/rclone/rclone/backend/local.newFadvise
	github.com/rclone/rclone/backend/local/fadvise_unix.go:74 +0xff

goroutine 72 [select]:
github.com/rclone/rclone/cmd.StartStats.func2(0xc000726190, 0xc0001f62a0)
	github.com/rclone/rclone/cmd/cmd.go:360 +0x10d
created by github.com/rclone/rclone/cmd.StartStats
	github.com/rclone/rclone/cmd/cmd.go:356 +0xb1

goroutine 84 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc00042cfb0, 0xc004bed000, 0x1)
	runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00042cfac)
	sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	sync/mutex.go:81
github.com/rclone/rclone/fs/asyncreader.(*AsyncReader).Abandon(0xc00042cf50)
	github.com/rclone/rclone/fs/asyncreader/asyncreader.go:302 +0x1bf
github.com/rclone/rclone/fs/asyncreader.(*AsyncReader).Close(0xc00042cf50, 0xc0001fcd90, 0x2107840)
	github.com/rclone/rclone/fs/asyncreader/asyncreader.go:317 +0x2b
github.com/rclone/rclone/fs/accounting.(*Account).Close(0xc004bb3100, 0x0, 0x0)
	github.com/rclone/rclone/fs/accounting/accounting.go:411 +0x8f
github.com/rclone/rclone/fs/operations.Copy(0x2137620, 0xc000712300, 0x214bfb8, 0xc0003eaea0, 0x0, 0xc0001fcd90, 0xc000866280, 0x48, 0x214cf78, 0xc000863800, ...)
	github.com/rclone/rclone/fs/operations/operations.go:474 +0x18ca
github.com/rclone/rclone/fs/sync.(*syncCopyMove).pairCopyOrMove(0xc000678400, 0x2137620, 0xc000712300, 0xc000a780a0, 0x214bfb8, 0xc0003eaea0, 0x4b, 0xc000678500)
	github.com/rclone/rclone/fs/sync/sync.go:395 +0x232
created by github.com/rclone/rclone/fs/sync.(*syncCopyMove).startTransfers
	github.com/rclone/rclone/fs/sync/sync.go:422 +0xcd

goroutine 590 [syscall, 4 minutes]:
os/signal.signal_recv(0x1ea8548)
	runtime/sigqueue.go:168 +0xa5
os/signal.loop()
	os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
	os/signal/signal.go:151 +0x45

goroutine 538 [chan receive, 4 minutes]:
github.com/rclone/rclone/lib/atexit.Register.func1.1()
	github.com/rclone/rclone/lib/atexit/atexit.go:41 +0x45
created by github.com/rclone/rclone/lib/atexit.Register.func1
	github.com/rclone/rclone/lib/atexit/atexit.go:40 +0x94

goroutine 2787 [select]:
github.com/rclone/rclone/fs/accounting.(*Account).averageLoop(0xc004bb3100)
	github.com/rclone/rclone/fs/accounting/accounting.go:212 +0x109
created by github.com/rclone/rclone/fs/accounting.newAccountSizeName
	github.com/rclone/rclone/fs/accounting/accounting.go:110 +0x449

goroutine 2762 [select, 2 minutes]:
io.(*pipe).Write(0xc000abbb00, 0xc005130000, 0xd8000, 0xd8000, 0x28000, 0x0, 0x0)
	io/pipe.go:94 +0x1c5
io.(*PipeWriter).Write(0xc000a1e180, 0xc005108000, 0x100000, 0x100000, 0x0, 0x4000000000000000, 0xc004bb31a0)
	io/pipe.go:163 +0x4c
mime/multipart.(*part).Write(0xc00007f760, 0xc005108000, 0x100000, 0x100000, 0xc0000d9d20, 0x100001, 0x100000)
	mime/multipart/writer.go:196 +0x62
github.com/rclone/rclone/fs/accounting.(*accountWriteTo).Write(0xc000720d68, 0xc005108000, 0x100000, 0x100000, 0x100000, 0x0, 0x0)
	github.com/rclone/rclone/fs/accounting/accounting.go:365 +0xa7
github.com/rclone/rclone/fs/asyncreader.(*AsyncReader).WriteTo(0xc00042cf50, 0x2105ba0, 0xc000720d68, 0xdff000, 0x0, 0x0)
	github.com/rclone/rclone/fs/asyncreader/asyncreader.go:188 +0x148
github.com/rclone/rclone/fs/accounting.(*Account).WriteTo(0xc004bb3100, 0x2107300, 0xc00007f760, 0x7efe362cc180, 0xc004bb3100, 0x1)
	github.com/rclone/rclone/fs/accounting/accounting.go:381 +0x116
io.copyBuffer(0x2107300, 0xc00007f760, 0x2105b60, 0xc004bb3100, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc004e98450)
	io/io.go:405 +0x3ac
io.Copy(...)
	io/io.go:382
github.com/rclone/rclone/lib/rest.MultipartUpload.func1(0xc004e99f20, 0xc004e99f50, 0xc000a1e180, 0x2105b60, 0xc004bb3100, 0x1dcfc2a, 0x8, 0xc000866280, 0x48)
	github.com/rclone/rclone/lib/rest/rest.go:367 +0x2f9
created by github.com/rclone/rclone/lib/rest.MultipartUpload
	github.com/rclone/rclone/lib/rest/rest.go:347 +0x54e

rax    0xca
rbx    0x2d23ba0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x2d23cf0
rsi    0x80
rbp    0x7ffc1d6f94e0
rsp    0x7ffc1d6f9498
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x7ffc1d6f9550
r13    0x2bd
r14    0x80c00b400000
r15    0x0
rip    0x473fa1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
Script done, file is duration3.txt

Thank you for that - very useful.

Can you try your test with this? I think it should fix the problem, but if it doesn't can you kill with SIGINT as before and send the trace - thanks!

v1.55.0-beta.5342.7cdfc65c4.fix-multipart-context on branch fix-multipart-context (uploaded in 15-30 mins)

That one terminated properly. I'll try a few more runs to be sure.

One thing I did notice is it returned a $? status of zero to bash. I'm not sure what is most appropriate when a transfer completes up to the point it is told to abort.

After several more runs, seems to work reliably.

I am still a bit concerned that the pending change on Box's end, requiring case-insensitive parsing of response headers like Retry-After (or is it rETRy-AFTer ??), may still have some hidden impact that got us into this state in the first place in normal situations where the deadline isn't an issue but some other error that should be recoverable triggers a bigger error.

According to my IT team, Box says the test in April will be run on 22nd April, 9AM PST to 23rd April, 9AM PST.

I suspect that there is some additional, possibly recoverable, error that will occur occasionally with the new version. I'll be sure to do some heavy transfers at the beginning of that window.

Thanks for testing. I've merged that fix to master now which means it will be in the latest beta in 15-30 mins and released in v1.55.

Let me know if you have any more problems.