go之在这个例子中爬虫多线程是如何工作的

zhenyulu 阅读:70 2025-06-02 22:19:02 评论:0

我正在尝试解决有关使用缓存并行获取 URL 以避免重复的任务。
我找到了正确的解决方案并且可以理解。我看到正确的答案包含 channel 和 gorutines 通过 chan 将 URL 推送到缓存中。但是为什么我的简单代码不能正常工作?
我不知道哪里出错了。

package main 
 
import ( 
    "fmt" 
    "sync" 
) 
 
type Fetcher interface { 
    // Fetch returns the body of URL and 
    // a slice of URLs found on that page. 
    Fetch(url string) (body string, urls []string, err error) 
} 
 
var cache = struct { 
    cache map[string]int 
    mux sync.Mutex 
}{cache: make(map[string]int)} 
 
// Crawl uses fetcher to recursively crawl 
// pages starting with url, to a maximum of depth. 
func Crawl(url string, depth int, fetcher Fetcher) { 
    // TODO: Fetch URLs in parallel. 
    // TODO: Don't fetch the same URL twice. 
    // This implementation doesn't do either: 
 
    if depth <= 0 { 
        return 
    } 
    cache.mux.Lock() 
    cache.cache[url] = 1 //put url in cache 
    cache.mux.Unlock() 
    body, urls, err := fetcher.Fetch(url) 
    if err != nil { 
        fmt.Println(err) 
        return 
    } 
 
    fmt.Printf("found: %s %q\n", url, body) 
    for _, u := range urls { 
        cache.mux.Lock() 
        if _, ok := cache.cache[u]; !ok { //check if url already in cache 
            cache.mux.Unlock() 
            go Crawl(u, depth-1, fetcher) 
        } else { 
            cache.mux.Unlock() 
        } 
    } 
    return 
} 
 
func main() { 
    Crawl("http://golang.org/", 4, fetcher) 
} 
 
// fakeFetcher is Fetcher that returns canned results. 
type fakeFetcher map[string]*fakeResult 
 
type fakeResult struct { 
    body string 
    urls []string 
} 
 
func (f fakeFetcher) Fetch(url string) (string, []string, error) { 
    if res, ok := f[url]; ok { 
        return res.body, res.urls, nil 
    } 
    return "", nil, fmt.Errorf("not found: %s", url) 
} 
 
// fetcher is a populated fakeFetcher. 
var fetcher = fakeFetcher{ 
    "http://golang.org/": &fakeResult{ 
        "The Go Programming Language", 
        []string{ 
            "http://golang.org/pkg/", 
            "http://golang.org/cmd/", 
        }, 
    }, 
    "http://golang.org/pkg/": &fakeResult{ 
        "Packages", 
        []string{ 
            "http://golang.org/", 
            "http://golang.org/cmd/", 
            "http://golang.org/pkg/fmt/", 
            "http://golang.org/pkg/os/", 
        }, 
    }, 
    "http://golang.org/pkg/fmt/": &fakeResult{ 
        "Package fmt", 
        []string{ 
            "http://golang.org/", 
            "http://golang.org/pkg/", 
        }, 
    }, 
    "http://golang.org/pkg/os/": &fakeResult{ 
        "Package os", 
        []string{ 
            "http://golang.org/", 
            "http://golang.org/pkg/", 
        }, 
    }, 
} 
 

升级版:
执行结果:
found: http://golang.org/ "The Go Programming Language" 
 
Process finished with exit code 0 

当我启动 goroutine 时,我觉得没有递归。但是,如果在线上设置断点来检查 cahce 中的 URL,我得到了这个:
found: http://golang.org/ "The Go Programming Language" 
found: http://golang.org/pkg/ "Packages" 
 
Debugger finished with exit code 0 

所以这意味着递归有效,但是出了点问题,我猜是什么种族?
当在 runes 例程的行上添加第二个断点时,会发生更有趣的事情:
found: http://golang.org/ "The Go Programming Language" 
found: http://golang.org/pkg/ "Packages" 
fatal error: all goroutines are asleep - deadlock! 
 
goroutine 1 [semacquire]: 
sync.runtime_SemacquireMutex(0x58843c, 0x0, 0x1) 
        /usr/local/go/src/runtime/sema.go:71 +0x47 
sync.(*Mutex).lockSlow(0x588438) 
        /usr/local/go/src/sync/mutex.go:138 +0x295 
sync.(*Mutex).Lock(0x588438) 
        /usr/local/go/src/sync/mutex.go:81 +0x58 
main.Crawl(0x4e9cf9, 0x12, 0x4, 0x4f7700, 0xc00008c180) 
        /root/go/src/crwaler/main.go:38 +0x46c 
main.main() 
        /root/go/src/crwaler/main.go:48 +0x57 
 
goroutine 18 [semacquire]: 
sync.runtime_SemacquireMutex(0x58843c, 0x0, 0x1) 
        /usr/local/go/src/runtime/sema.go:71 +0x47 
sync.(*Mutex).lockSlow(0x588438) 
        /usr/local/go/src/sync/mutex.go:138 +0x295 
sync.(*Mutex).Lock(0x588438) 
        /usr/local/go/src/sync/mutex.go:81 +0x58 
main.Crawl(0x4ea989, 0x16, 0x3, 0x4f7700, 0xc00008c180) 
        /root/go/src/crwaler/main.go:38 +0x46c 
created by main.Crawl 
        /root/go/src/crwaler/main.go:41 +0x563 
 
Debugger finished with exit code 0 

请您参考如下方法:

您的 main()直到全部 go Crawl() 才会阻塞调用完成,因此退出。您可以使用 sync.WaitGroup或者一个 channel 来同步程序以所有 goroutine 的完成而结束。

我还发现变量 u 存在问题在 goroutine 中使用;在执行 goroutine 时,u 的值范围循环可能会也可能不会改变。
Crawl结束看起来像这样可以解决这两个问题;

 
wg := sync.WaitGroup{} 
 
fmt.Printf("found: %s %q\n", url, body) 
for _, u := range urls { 
    cache.mux.Lock() 
    if _, ok := cache.cache[u]; !ok { //check if url already in cache 
        cache.mux.Unlock() 
        wg.Add(1) 
        go func(url string) { 
            Crawl(url, depth-1, fetcher) 
            wg.Done() 
        }(u) 
    } else { 
        cache.mux.Unlock() 
    } 
} 
 
// Block until all goroutines are done 
wg.Wait() 
 
return 


标签:多线程
声明

1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,请转载时务必注明文章作者和来源,不尊重原创的行为我们将追究责任;3.作者投稿可能会经我们编辑修改或补充。

关注我们

一个IT知识分享的公众号