【求助】如何抓取html指定内容对应的值

QAWS12g · 2024 年4 月 15 日 03:21

java代码如下

    public void getUserId() throws URISyntaxException, IOException, InterruptedException {
        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
                .uri(new URI("https://www.nowcoder.com/search/user?query=EnofTaiPeople&type=user&searchType=%E9%A1%B6%E9%83%A8%E5%AF%BC%E8%88%AA%E6%A0%8F&subType=0"))
                .build();

        HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

        String html = response.body();

下面是`html` 中内容,代码量超出限制，如文件所示

wrapper.txt (44.9 KB)

我想抓取`userId`对应的值，如`"userId":497256985`, 核心代码定位如下，可ctrl+f 页面定位

</section><script>window.__INITIAL_STATE__={"prefetchData":{"1":{"userInfo":{}}},"store":{"app":{"userInfo":{},"refreshed":false,"scrolled":false},"creation":{"isTheMainState":false},"enterprise":{"enterpriseInfo":{},"enterpriseInfoNP":{},"enterpriseUser":{},"enterpriseQuestionCount":0,
"enterpriseId":"","falseCompanyId":"","companyParams":{},"enterpriseInterviewCount":0,
"enterpriseBeginnerGuideFlag":true,"timelineRed":null,
"_allCareerJobs":[],"_communityBrief":[],"jobList":[],"hrList":[],
"jobCondition":{"city":"","query":"","salary":null,"recruitType":0,"career":[],"page":1,"totalCount":0,"pageSize":20},"jobMenu":{},"abResultForPublish":"qyzy_publish_ab_show"
,"jobSearchProgressAB":"jobSearchProgress_AB_a"
,"jobScheduleCount":0,"evaluationCount":0,"salaryCount":0,"forbidLoginPopup":true,"currentUrl":""},"exam":{"job":{},"topicId":-1,
"gioBaseData":{},"miniAdPicUrl":""
,"isComplete":true,"isCompleteFinish":false,"isCompletePromise":{}},"interviewQuestion":{"interviewFilterObj":{}},"live":{"barrages":[],"needUpdateQaList":false,"hotValue":0,"liveStatus":-1,"drawStatus":false,"officialStatus":false,"companyStatus":false,"liveProcess":{"barrageVos":[],"process":[],"isDraw":false},
"currentNodeTiming":{"timing":0},"currentPlaybackTiming":0,"barrageKeywords":{"keywords":""},"drawDialogInfo":{"currentPrizeInfo":{},"dialogVisible":false},"currentUnderTab":{"tabName":""},"isfullScreen":false,"accountId":0,"accountType":2},"profile":{"isSelf":false,"profile":{},"followData":{"fansCount":0,"followCount":0,"likeCount":0,"visitorCount":0,
"blackCount":0,"shieldCount":0,"longContentCount":0,"momentCount":0},
"isNewExamTest":true,"isEnterprise":false,"isHaveLive":false,"enterpriseLivingId":0},"QuestionDetail":{"gioBaseData":{},"isFlod":false},"search":{"searchParams":{"query":"EnofTaiPeople","type":"user","searchType":"顶部导航栏","subType":0},"routerLoading":false,"logId":"14d6d830-fad0-11ee-88f0-8b1b27dd69b7","sessionId":"6699_1713148210483_7234","standardIds":[],"forbidLoginPopup":false,
"relationSearchData":[{"name":"华为","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},
"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"新凯来","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"美团","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},
"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"招银网络","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},
"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"腾讯","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"拼多多","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"阿里控股笔试","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},
"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"小米"
,"extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"vivo","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null}]},"terminal":{"logId":""},"testPaper":{"fullscreen":false,"darkMode":false,"questionsMap":{},"questions":[],"paperSummary":{},"currentQuestionIndex":0,"currentExamStatus":"Ready","timeCount":0,"pageSource":"公司真题","testReportData":{},"newEditorAbFlag":false}},"pinia":{"app":{"userInfo":{},"isLogin":false,"isBusinessModel":false,"refreshed":false,"scrolled":false},"abTest":{"enterpriseForbidLoginPopupAB":"getRegisterTest_ab_B","newInterviewQuestionAB":true,"newPublishFromSearchAB":false,"companyQuestionAB":false,"zhuanxianglianxiAB":false}},"app":{"240":{"pageList":[{},{"showLoading":false,"list":[{"rc_type":0,"entityDataId":0,"trackId":"2d3v582c3ims9rtrlku3l",
"title":null,"expandType":0,

"extraInfo":null,"userId":497256985,"nickname":"EnofTaiPeople",
"admin":false,"headImgUrl":"https:\u002F\u002Fstatic.nowcoder.com\u002Fhead\u002F1photo.jpg","gender":null,"headDecorateUrl":"","jobId":11003,"pcHeadDecorateUrl":null,
"honorLevel":2,"honorLevelName":"小白牛 Lv.1","honorLevelColor":"c278e7",
"workTime":"2026",


"educationInfo":"湖南省长沙市长郡中学","identityList":null,"jobName":"C++","followed":false,"enterpriseInfo":null}]}],
"total":1,"totalPage":1,"noData":false,
"hasInited":true,"page":1},"242":{"loginRegisterTestAb":"getRegisterTest_ab_A"}},
"path":"\u002Fsearch\u002Fuser",
"fullPath":"\u002Fsearch\u002Fuser?query=EnofTaiPeople&type=user&searchType=%E9%A1%B6%E9%83%A8%E5%AF%BC%E8%88%AA%E6%A0%8F&subType=0"};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script>

求大佬指点

仅作为学习交流Java，无其他用途
说明：已咨询copilot无果

MatsuzakaSato · 2024 年4 月 15 日 03:22

如果是Python，那么BeautifulSoap4

handsome · 2024 年4 月 15 日 03:23

用字典读出来？

Kakarotto · 2024 年4 月 15 日 03:25

Java也有类似Python BeautifulSoap4的文档解析库的

dalong · 2024 年4 月 15 日 03:25

暴力一点，直接正则提取

Kakarotto · 2024 年4 月 15 日 03:27

翻了下之前写过的爬虫，引入jSoup依赖，看一些jSoup的教程，解析出来就好了

baipiaodang · 2024 年4 月 15 日 03:28

来学习了

user135 · 2024 年4 月 15 日 03:28

学一下 xpath 啊

QAWS12g · 2024 年4 月 15 日 03:29

能不能手把手教一下，我之前也尝试过

manify · 2024 年4 月 15 日 03:29

py的漂亮勺子一般就可以解析

dalong · 2024 年4 月 15 日 03:29

而且你这个是js，给的json，不考虑层级变化的话，可以直接解析出来呀。
用正则更简单点，怎么变化层级都没事

ccqc · 2024 年4 月 15 日 03:31

用xpath解析啊

QAWS12g · 2024 年4 月 15 日 03:31

好，这个是新名词，待会去学习一下

Kakarotto · 2024 年4 月 15 日 03:34

看了下wrapper.txt，发现userId":497256985只有一处，可以用正则直接提取了

fykang · 2024 年4 月 15 日 03:34

简单的用jsonpath语法就能读取出来，你里面有数组类型的数据结构就需要写一下循环遍历一下
$.app.240.pageList[1].list[0].userId

live · 2024 年4 月 15 日 03:36

要我就正则 …

QAWS12g · 2024 年4 月 15 日 03:37

感谢佬友们，问题已解决(之前copilot咋没这么快的思路呢 )
用直接用正则

        // 获取HTML内容
        String html = response.body();

        // 定义正则表达式
        Pattern pattern = Pattern.compile("\"userId\":(\\d+),");


        // 创建Matcher对象
        Matcher matcher = pattern.matcher(html);

        // 查找匹配的字符串
        if (matcher.find()) {
            System.out.println("Matched text: " + matcher.group(1));
        } else {
            System.out.println("No match found.");
        }

Matched text: 497256985

QAWS12g · 2024 年4 月 15 日 03:38

学习了

QAWS12g · 2024 年4 月 15 日 03:39

又是学习的一天

tsuki · 2024 年4 月 15 日 03:42

这是script标签，后面跟的json格式数据，按照我以往的经验我会用正则或者切片，然后格式化json后提取。（也可暴力点，直接正则提）

话题		回复	浏览量
爬取图片失败(document无图片链接) 快问快答	1	187	2024 年4 月 18 日
有没有高效，快速的爬虫？快问快答小工具	9	345	2024 年4 月 27 日
想爬一点数据，奈何每点下一页要验证码，求教大佬们快问快答	46	889	2024 年4 月 5 日
有没有开源的网站备份/克隆工具快问快答	9	257	2024 年3 月 7 日
有没有网站工具快问快答小工具	1	202	2024 年2 月 27 日

【求助】如何抓取html指定内容对应的值

java代码如下

下面是html 中内容,代码量超出限制，如文件所示

我想抓取userId对应的值 ，如"userId":497256985, 核心代码定位如下，可ctrl+f 页面定位

求大佬指点

相关话题

下面是`html` 中内容,代码量超出限制，如文件所示

我想抓取`userId`对应的值，如`"userId":497256985`, 核心代码定位如下，可ctrl+f 页面定位