Dgraph bulk crashes on bufio.(*Reader)

Dgraph bulk has been crashing on us during the map phase for a large number of files (1.6T of files, split into ~650M files). The log is:

MAP 04h33m16s nquad_count:2.747G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.495G edge_speed:335.1k/sec
MAP 04h33m17s nquad_count:2.747G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.495G edge_speed:335.1k/sec
MAP 04h33m18s nquad_count:2.748G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.495G edge_speed:335.1k/sec
MAP 04h33m19s nquad_count:2.748G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.496G edge_speed:335.1k/sec
MAP 04h33m20s nquad_count:2.748G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.496G edge_speed:335.1k/sec
Processing file (574 out of 2260): /home/kevin/prairiedog/outputs/samples/SRR2567096.fasta.rdf
MAP 04h33m21s nquad_count:2.748G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.496G edge_speed:335.1k/sec
MAP 04h33m22s nquad_count:2.748G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.497G edge_speed:335.1k/sec
MAP 04h33m23s nquad_count:2.749G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.497G edge_speed:335.1k/sec
MAP 04h33m24s nquad_count:2.749G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.497G edge_speed:335.1k/sec
MAP 04h33m25s nquad_count:2.749G err_count:0.000 nquad_speed:167.6k/sec edge_count:5.498G edge_speed:335.1k/sec
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 799608 [runnable]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc051884090, 0x21, 0xc00fec55c0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:78
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 799609 [runnable]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc0518840f0, 0x21, 0xc00fec5620)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:78
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 799610 [runnable]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc051884150, 0x21, 0xc00fec5680)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:78
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 799611 [runnable]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc0518841b0, 0x21, 0xc00fec56e0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:78
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 799612 [runnable]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc051884210, 0x21, 0xc00fec5740)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:78
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 799613 [runnable]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc051884270, 0x21, 0xc00fec57a0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:78
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 799614 [syscall]:
syscall.Syscall(0x0, 0x2010, 0xc24dd52000, 0x4000, 0xc06aa36240, 0x0, 0x8de2c0)
        /usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.read(0x2010, 0xc24dd52000, 0x4000, 0x4000, 0xc06aa36201, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:732 +0x5a
syscall.Read(0x2010, 0xc24dd52000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall_unix.go:172 +0x49
internal/poll.(*FD).Read(0xc00fec5aa0, 0xc24dd52000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:165 +0x119
os.(*File).read(0xc00000e180, 0xc24dd52000, 0x4000, 0x4000, 0x7f23970c1260, 0x1, 0x10000c07554b658)
        /usr/local/go/src/os/file_unix.go:249 +0x4e
os.(*File).Read(0xc00000e180, 0xc24dd52000, 0x4000, 0x4000, 0xc07554b688, 0xc07554b688, 0x8c89a7)
        /usr/local/go/src/os/file.go:108 +0x69
bufio.(*Reader).fill(0xc00fec5b00)
        /usr/local/go/src/bufio/bufio.go:100 +0x10f
bufio.(*Reader).Peek(0xc00fec5b00, 0xa, 0x400, 0xc28c31f800, 0x400, 0x400, 0x0)
        /usr/local/go/src/bufio/bufio.go:132 +0x3f
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc0518842d0, 0x21, 0xc00fec5800)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:86 +0x36e
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

With Dgraph Zero running in Docker. We’re on a regular disk, not SSDs, so I’m wondering if that’s the issue? Cpu specs are:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       40 bits physical, 48 bits virtual
CPU(s):              24
On-line CPU(s) list: 0-23
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               44
Model name:          Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz
Stepping:            2
CPU MHz:             3333.482
CPU max MHz:         3068.0000
CPU min MHz:         1600.0000
BogoMIPS:            6133.61
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            12288K
NUMA node0 CPU(s):   0-5,12-17
NUMA node1 CPU(s):   6-11,18-23
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat

and 94G of ram. The command we’re using is:

dgraph bulk -r outputs/samples/ -s kmers.schema --map_shards=1 --reduce_shards=1 --http localhost:8001 --zero=localhost:5080 -j 12

Any insight would be appreciated! Thanks in advance,
Kevin

Hi Kevin,

will you let us know the version of dgraph that you are running? You could run dgraph version and copy the output here.

Thanks

Certainly, thanks for the quick reply. Its:

Dgraph version   : v1.0.12
Commit SHA-1     : 60d9ef0a
Commit timestamp : 2019-03-05 17:59:30 -0800
Branch           : HEAD
Go version       : go1.11.5

Also, while I am looking into the issue, I was wondering if you could provide more logs. Specifically, I do not see the errors in the logs. I only see the stack trace of various goroutines.

Sounds good. I’ve only seen the stack trace outputted so far without any specific error, but maybe we’re losing it somehow (running in screen).

I’ll turn on logging for screen and re-run dgraph bulk; should hit the same issue in a few hours and have the results back to you by tomorrow at the latest.

Found it, thread exhaustion

badger 2019/07/16 11:00:48 INFO: Storing value log head: {Fid:3 Len:47 Offset:35967093}
badger 2019/07/17 04:40:50 INFO: Storing value log head: {Fid:4 Len:47 Offset:39920719}
badger 2019/07/17 04:40:53 INFO: Force compaction on level 0 done
badger 2019/07/17 04:40:54 INFO: All 0 tables opened in 0s
runtime: program exceeds 10000-thread limit
fatal error: thread exhaustion

runtime stack:
runtime.throw(0x147bfcd, 0x11)
        /usr/local/go/src/runtime/panic.go:608 +0x72
runtime.checkmcount()
        /usr/local/go/src/runtime/proc.go:598 +0xac
runtime.mcommoninit(0xc08d026e00)
        /usr/local/go/src/runtime/proc.go:616 +0x76
runtime.allocm(0xc000506f00, 0x0, 0xc0481f6e00)
        /usr/local/go/src/runtime/proc.go:1573 +0x14d
runtime.newm(0x0, 0xc000506f00)
        /usr/local/go/src/runtime/proc.go:1891 +0x39
runtime.startm(0xc000506f00, 0x1fa3500)
        /usr/local/go/src/runtime/proc.go:2066 +0x139
runtime.handoffp(0xc000506f00)
        /usr/local/go/src/runtime/proc.go:2093 +0x55
runtime.retake(0x10761283b13291, 0xd00000018)
        /usr/local/go/src/runtime/proc.go:4488 +0x139
runtime.sysmon()
        /usr/local/go/src/runtime/proc.go:4406 +0x22e
runtime.mstart1()
        /usr/local/go/src/runtime/proc.go:1282 +0xe6
runtime.mstart()
        /usr/local/go/src/runtime/proc.go:1248 +0x6e

goroutine 1 [chan receive]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).run(0xc03cea00c0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:37 +0x1aa
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).reduceStage(0xc00068f830)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/loader.go:282 +0x117
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.run()
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/run.go:192 +0xbab
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.init.0.func1(0xc0001ac480, 0xc0004468f0, 0x0, 0xb)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/run.go:44 +0x52
github.com/dgraph-io/dgraph/vendor/github.com/spf13/cobra.(*Command).execute(0xc0001ac480, 0xc000446840, 0xb, 0xb, 0xc0001ac480, 0xc000446840)
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/spf13/cobra/command.go:702 +0x2d3
github.com/dgraph-io/dgraph/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x1efa740, 0xc, 0x0, 0x0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/spf13/cobra/command.go:783 +0x2dc
github.com/dgraph-io/dgraph/vendor/github.com/spf13/cobra.(*Command).Execute(0x1efa740, 0x8b60ad, 0x1f02928)
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/spf13/cobra/command.go:736 +0x2b
github.com/dgraph-io/dgraph/dgraph/cmd.Execute()
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:60 +0x3b
main.main()
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:33 +0x88

goroutine 5 [chan receive]:
github.com/dgraph-io/dgraph/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x1fa3400)
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/golang/glog/glog.go:882 +0x8b
created by github.com/dgraph-io/dgraph/vendor/github.com/golang/glog.init.0
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/golang/glog/glog.go:410 +0x203

goroutine 19 [syscall, 1071 minutes]:
os/signal.signal_recv(0x14c5980)
        /usr/local/go/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
        /usr/local/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 20 [chan receive]:
github.com/dgraph-io/dgraph/x.init.1.func1()
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/metrics.go:81 +0x7e
created by github.com/dgraph-io/dgraph/x.init.1
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/metrics.go:76 +0x3ce

goroutine 9 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000aacc0)
        /go/src/go.opencensus.io/stats/view/worker.go:147 +0xdd
created by go.opencensus.io/stats/view.init.0
        /go/src/go.opencensus.io/stats/view/worker.go:29 +0x57

goroutine 10 [chan receive]:
github.com/dgraph-io/dgraph/worker.(*rateLimiter).bleed(0x1fc0f6c)
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/proposal.go:61 +0x95
created by github.com/dgraph-io/dgraph/worker.init.0
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/proposal.go:46 +0x41

goroutine 107 [IO wait, 1071 minutes]:
internal/poll.runtime_pollWait(0x7f8e3908fe30, 0x72, 0x0)
        /usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc000442418, 0x72, 0xc0000aa300, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).waitRead(0xc000442418, 0xffffffffffffff00, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc000442400, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:384 +0x1a0
net.(*netFD).accept(0xc000442400, 0xc000187500, 0x30, 0x30)
        /usr/local/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc00043a0b0, 0x50, 0x7f8e3d300440, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:139 +0x2e
net.(*TCPListener).AcceptTCP(0xc00043a0b0, 0x891008, 0x30, 0x13de580)
        /usr/local/go/src/net/tcpsock.go:247 +0x47
net/http.tcpKeepAliveListener.Accept(0xc00043a0b0, 0x13de580, 0xc0001662a0, 0x132bd40, 0x1ee30d0)
        /usr/local/go/src/net/http/server.go:3232 +0x2f
net/http.(*Server).Serve(0xc000030820, 0x15ac2a0, 0xc00043a0b0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2826 +0x22f
net/http.(*Server).ListenAndServe(0xc000030820, 0xc000030820, 0x0)
        /usr/local/go/src/net/http/server.go:2764 +0xb6
net/http.ListenAndServe(0x7ffdce2891a1, 0xe, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:3004 +0x74
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.run.func1(0xc00041a0e0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/run.go:167 +0x4a
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.run
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/run.go:166 +0x908

goroutine 159 [select, 109 minutes]:
google.golang.org/grpc/transport.(*controlBuffer).get(0xc000440340, 0x1, 0x0, 0x0, 0x0, 0x0)
        /ext-go/1/src/google.golang.org/grpc/transport/controlbuf.go:300 +0x103
google.golang.org/grpc/transport.(*loopyWriter).run(0xc00014c120, 0x0, 0x0)
        /ext-go/1/src/google.golang.org/grpc/transport/controlbuf.go:387 +0x1ac
google.golang.org/grpc/transport.newHTTP2Client.func3(0xc0000e8480)
        /ext-go/1/src/google.golang.org/grpc/transport/http2_client.go:303 +0x7b
created by google.golang.org/grpc/transport.newHTTP2Client
        /ext-go/1/src/google.golang.org/grpc/transport/http2_client.go:301 +0xc3a

goroutine 108 [select, 1071 minutes]:
google.golang.org/grpc.(*ccResolverWrapper).watcher(0xc00068f3e0)
        /ext-go/1/src/google.golang.org/grpc/resolver_conn_wrapper.go:109 +0x142
created by google.golang.org/grpc.(*ccResolverWrapper).start
        /ext-go/1/src/google.golang.org/grpc/resolver_conn_wrapper.go:95 +0x3f

goroutine 109 [select, 1071 minutes]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc0001aaa80)
        /ext-go/1/src/google.golang.org/grpc/balancer_conn_wrappers.go:122 +0x10f
created by google.golang.org/grpc.newCCBalancerWrapper
        /ext-go/1/src/google.golang.org/grpc/balancer_conn_wrappers.go:113 +0x13d

goroutine 110 [select, 1071 minutes]:
google.golang.org/grpc.(*addrConn).transportMonitor(0xc000450000)
        /ext-go/1/src/google.golang.org/grpc/clientconn.go:1384 +0x1c1
google.golang.org/grpc.(*addrConn).connect.func1(0xc000450000)
        /ext-go/1/src/google.golang.org/grpc/clientconn.go:960 +0x1b2
created by google.golang.org/grpc.(*addrConn).connect
        /ext-go/1/src/google.golang.org/grpc/clientconn.go:951 +0xe1

goroutine 158 [IO wait, 109 minutes]:
internal/poll.runtime_pollWait(0x7f8e3908ff00, 0x72, 0xc000095bb8)
        /usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc000444918, 0x72, 0xffffffffffffff00, 0x159cfe0, 0x1f02dd0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).waitRead(0xc000444918, 0xc000476000, 0x8000, 0x8000)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc000444900, 0xc000476000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:169 +0x179
net.(*netFD).Read(0xc000444900, 0xc000476000, 0x8000, 0x8000, 0x11, 0x0, 0x0)
        /usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc00043a0e8, 0xc000476000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:177 +0x68
bufio.(*Reader).Read(0xc000428e40, 0xc00047e038, 0x9, 0x9, 0x8de930, 0xc000616a80, 0x4)
        /usr/local/go/src/bufio/bufio.go:216 +0x22f
io.ReadAtLeast(0x159a9c0, 0xc000428e40, 0xc00047e038, 0x9, 0x9, 0x9, 0xc000095e08, 0x3, 0x0)
        /usr/local/go/src/io/io.go:310 +0x88
io.ReadFull(0x159a9c0, 0xc000428e40, 0xc00047e038, 0x9, 0x9, 0xc0000b8420, 0xc000095e50, 0x88ac54)
        /usr/local/go/src/io/io.go:329 +0x58
golang.org/x/net/http2.readFrameHeader(0xc00047e038, 0x9, 0x9, 0x159a9c0, 0xc000428e40, 0x0, 0x1007f8e00000000, 0xc000099370, 0xc000440350)
        /ext-go/1/src/golang.org/x/net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc00047e000, 0xc1b2a84200, 0xc1b2a84200, 0x0, 0x0)
        /ext-go/1/src/golang.org/x/net/http2/frame.go:492 +0xa3
google.golang.org/grpc/transport.(*http2Client).reader(0xc0000e8480)
        /ext-go/1/src/google.golang.org/grpc/transport/http2_client.go:1135 +0x12b
created by google.golang.org/grpc/transport.newHTTP2Client
        /ext-go/1/src/google.golang.org/grpc/transport/http2_client.go:270 +0xad7

goroutine 113 [select]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*progress).report(0xc0001ca000)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/progress.go:67 +0xc1
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.newLoader
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/loader.go:112 +0x634

goroutine 178 [chan receive]:
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y.(*WaterMark).process(0xc0001aad00)
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y/watermark.go:203 +0x22a
created by github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y.(*WaterMark).Init
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y/watermark.go:73 +0xbf

goroutine 179 [chan receive]:
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y.(*WaterMark).process(0xc0001aad40)
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y/watermark.go:203 +0x22a
created by github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y.(*WaterMark).Init
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y/watermark.go:73 +0xbf

goroutine 187 [chan send, 1071 minutes]:
github.com/dgraph-io/dgraph/xidmap.New.func1(0xc00044e000, 0xc0000e2000)
        /ext-go/1/src/github.com/dgraph-io/dgraph/xidmap/xidmap.go:117 +0x20e
created by github.com/dgraph-io/dgraph/xidmap.New
        /ext-go/1/src/github.com/dgraph-io/dgraph/xidmap/xidmap.go:106 +0x279

goroutine 775948 [semacquire]:
sync.runtime_Semacquire(0xc03cea01a8)
        /usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc03cea01a0)
        /usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/dgraph-io/dgraph/x.(*Throttle).Wait(0xc03cea01a0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/max_waitgroup.go:55 +0x2d
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run(0xc0a5082050)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:62 +0x1c4
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).reduceStage.func1(0xc00068f830, 0xc18647a120)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/loader.go:274 +0x58
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).reduceStage
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/loader.go:272 +0x84

goroutine 776147 [chan receive]:
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y.(*WaterMark).process(0xc1ea8d6000)
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y/watermark.go:203 +0x22a
created by github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y.(*WaterMark).Init
        /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/y/watermark.go:73 +0xbf

goroutine 770193 [syscall]:
syscall.Syscall(0x0, 0x1b9d, 0xc272dd6000, 0x4000, 0xc22d90f100, 0xc03cc2dce0, 0x8de2c0)
        /usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.read(0x1b9d, 0xc272dd6000, 0x4000, 0x4000, 0xc22d90f101, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:732 +0x5a
syscall.Read(0x1b9d, 0xc272dd6000, 0x4000, 0x4000, 0x8, 0xc03cc2dda0, 0xc1552b9140)
        /usr/local/go/src/syscall/syscall_unix.go:172 +0x49
internal/poll.(*FD).Read(0xc1552b90e0, 0xc272dd6000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:165 +0x119
os.(*File).read(0xc03502e6c0, 0xc272dd6000, 0x4000, 0x4000, 0x7f8cffdcc9e8, 0x1, 0x101000000000000)
        /usr/local/go/src/os/file_unix.go:249 +0x4e
os.(*File).Read(0xc03502e6c0, 0xc272dd6000, 0x4000, 0x4000, 0xc24014b470, 0xc03cc2de88, 0x8c89a7)
        /usr/local/go/src/os/file.go:108 +0x69
bufio.(*Reader).fill(0xc1552b9140)
        /usr/local/go/src/bufio/bufio.go:100 +0x10f
bufio.(*Reader).Peek(0xc1552b9140, 0xa, 0x400, 0xc102c3c000, 0x400, 0x400, 0x0)
        /usr/local/go/src/bufio/bufio.go:132 +0x3f
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc24014b470, 0x21, 0xc1d6bca4e0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:86 +0x36e
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 785131 [syscall]:
syscall.Syscall(0x0, 0x2315, 0xc20e2ae000, 0x4000, 0xc03882b100, 0xc169c284e0, 0x8de2c0)
        /usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.read(0x2315, 0xc20e2ae000, 0x4000, 0x4000, 0xc03882b101, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:732 +0x5a
syscall.Read(0x2315, 0xc20e2ae000, 0x4000, 0x4000, 0x1, 0xc169c285a0, 0xc0daa993e0)
        /usr/local/go/src/syscall/syscall_unix.go:172 +0x49
internal/poll.(*FD).Read(0xc0daa99380, 0xc20e2ae000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:165 +0x119
os.(*File).read(0xc1b03821e0, 0xc20e2ae000, 0x4000, 0x4000, 0x7f8d964b6398, 0x1, 0x100000000000000)
        /usr/local/go/src/os/file_unix.go:249 +0x4e
os.(*File).Read(0xc1b03821e0, 0xc20e2ae000, 0x4000, 0x4000, 0xc0f6a6cab0, 0xc169c28688, 0x8c89a7)
        /usr/local/go/src/os/file.go:108 +0x69
bufio.(*Reader).fill(0xc0daa993e0)
        /usr/local/go/src/bufio/bufio.go:100 +0x10f
bufio.(*Reader).Peek(0xc0daa993e0, 0xa, 0x400, 0xc019703000, 0x400, 0x400, 0x0)
        /usr/local/go/src/bufio/bufio.go:132 +0x3f
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc0f6a6cab0, 0x21, 0xc1ce2f5ce0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:86 +0x36e
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 696351 [syscall]:
syscall.Syscall(0x0, 0x35f, 0xc2afbbc000, 0x4000, 0xc16b6d7e40, 0x0, 0x0)
        /usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.read(0x35f, 0xc2afbbc000, 0x4000, 0x4000, 0xc16b6d7e01, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:732 +0x5a
syscall.Read(0x35f, 0xc2afbbc000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall_unix.go:172 +0x49
internal/poll.(*FD).Read(0xc24b8c87e0, 0xc2afbbc000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:165 +0x119
os.(*File).read(0xc095afa050, 0xc2afbbc000, 0x4000, 0x4000, 0x7f8dd533c310, 0x1, 0x100000000000000)
        /usr/local/go/src/os/file_unix.go:249 +0x4e
os.(*File).Read(0xc095afa050, 0xc2afbbc000, 0x4000, 0x4000, 0xc29deafb90, 0xc10eb55e88, 0x8c89a7)
        /usr/local/go/src/os/file.go:108 +0x69
bufio.(*Reader).fill(0xc24b8c8840)
        /usr/local/go/src/bufio/bufio.go:100 +0x10f
bufio.(*Reader).Peek(0xc24b8c8840, 0xa, 0x400, 0xc043176800, 0x400, 0x400, 0x0)
        /usr/local/go/src/bufio/bufio.go:132 +0x3f
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc29deafb90, 0x21, 0xc012d501e0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:86 +0x36e
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:53 +0x171

goroutine 785127 [syscall]:
syscall.Syscall(0x0, 0x2316, 0xc07f01a000, 0x4000, 0xc10b082b40, 0xc0420face0, 0x8de2c0)
        /usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.read(0x2316, 0xc07f01a000, 0x4000, 0x4000, 0xc10b082b01, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:732 +0x5a
syscall.Read(0x2316, 0xc07f01a000, 0x4000, 0x4000, 0x1, 0xc0420fada0, 0xc03180f080)
        /usr/local/go/src/syscall/syscall_unix.go:172 +0x49
internal/poll.(*FD).Read(0xc03180f020, 0xc07f01a000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:165 +0x119
os.(*File).read(0xc2181b0088, 0xc07f01a000, 0x4000, 0x4000, 0x7f8c22aa4b50, 0x1, 0x100000000000000)
        /usr/local/go/src/os/file_unix.go:249 +0x4e
os.(*File).Read(0xc2181b0088, 0xc07f01a000, 0x4000, 0x4000, 0xc0f6a6c930, 0xc0420fae88, 0x8c89a7)
        /usr/local/go/src/os/file.go:108 +0x69
bufio.(*Reader).fill(0xc03180f080)
        /usr/local/go/src/bufio/bufio.go:100 +0x10f
bufio.(*Reader).Peek(0xc03180f080, 0xa, 0x400, 0xc02e1c3c00, 0x400, 0x400, 0x0)
        /usr/local/go/src/bufio/bufio.go:132 +0x3f
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xc0f6a6c930, 0x21, 0xc1ce2f5b60)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:86 +0x36e
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1

Found this: About bulk load failed with 10000-thread limit

I added debug.SetMaxThreads(30000) and built dgraph from source. Running it now, will let you know.

Great, thanks for the update. Please keep us posted.

That thread setting fixed our issue. Thanks for the help Aman!

[04:54:50-0600] REDUCE 40h55m33s 99.98% edge_count:11.02G edge_speed:125.5k/sec plist_count:3.950M plist_speed:44.99/sec
[04:54:51-0600] REDUCE 40h55m34s 99.98% edge_count:11.02G edge_speed:125.5k/sec plist_count:3.964M plist_speed:45.15/sec
[04:54:52-0600] REDUCE 40h55m35s 99.98% edge_count:11.02G edge_speed:125.5k/sec plist_count:3.978M plist_speed:45.30/sec
[04:54:53-0600] REDUCE 40h55m36s 99.98% edge_count:11.02G edge_speed:125.5k/sec plist_count:3.994M plist_speed:45.48/sec
[04:54:54-0600] REDUCE 40h55m37s 99.99% edge_count:11.02G edge_speed:125.5k/sec plist_count:4.010M plist_speed:45.67/sec
[04:54:55-0600] REDUCE 40h55m38s 99.99% edge_count:11.02G edge_speed:125.5k/sec plist_count:4.027M plist_speed:45.86/sec
[04:54:56-0600] REDUCE 40h55m39s 99.99% edge_count:11.02G edge_speed:125.5k/sec plist_count:4.037M plist_speed:45.97/sec
[04:54:57-0600] REDUCE 40h55m40s 99.99% edge_count:11.02G edge_speed:125.5k/sec plist_count:4.052M plist_speed:46.14/sec
[04:54:58-0600] REDUCE 40h55m41s 99.99% edge_count:11.02G edge_speed:125.5k/sec plist_count:4.067M plist_speed:46.31/sec
[04:54:59-0600] REDUCE 40h55m42s 100.00% edge_count:11.02G edge_speed:125.5k/sec plist_count:4.083M plist_speed:46.49/sec
[04:55:00-0600] REDUCE 40h55m43s 100.00% edge_count:11.02G edge_speed:125.5k/sec plist_count:4.100M plist_speed:46.69/sec
[04:55:01-0600] REDUCE 40h55m44s 100.00% edge_count:11.02G edge_speed:125.5k/sec plist_count:4.116M plist_speed:46.87/sec
[04:55:02-0600] REDUCE 40h55m45s 100.00% edge_count:11.02G edge_speed:125.5k/sec plist_count:4.116M plist_speed:46.87/sec
Total: 40h55m45s
1 Like

Great, awsome! Good to hear.