【求助】如何抓取html指定内容对应的值

java代码如下

    public void getUserId() throws URISyntaxException, IOException, InterruptedException {
        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
                .uri(new URI("https://www.nowcoder.com/search/user?query=EnofTaiPeople&type=user&searchType=%E9%A1%B6%E9%83%A8%E5%AF%BC%E8%88%AA%E6%A0%8F&subType=0"))
                .build();

        HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

        String html = response.body();

下面是html 中内容,代码量超出限制,如文件所示

wrapper.txt (44.9 KB)

我想抓取userId对应的值 ,如"userId":497256985, 核心代码定位如下,可ctrl+f 页面定位

</section><script>window.__INITIAL_STATE__={"prefetchData":{"1":{"userInfo":{}}},"store":{"app":{"userInfo":{},"refreshed":false,"scrolled":false},"creation":{"isTheMainState":false},"enterprise":{"enterpriseInfo":{},"enterpriseInfoNP":{},"enterpriseUser":{},"enterpriseQuestionCount":0,
"enterpriseId":"","falseCompanyId":"","companyParams":{},"enterpriseInterviewCount":0,
"enterpriseBeginnerGuideFlag":true,"timelineRed":null,
"_allCareerJobs":[],"_communityBrief":[],"jobList":[],"hrList":[],
"jobCondition":{"city":"","query":"","salary":null,"recruitType":0,"career":[],"page":1,"totalCount":0,"pageSize":20},"jobMenu":{},"abResultForPublish":"qyzy_publish_ab_show"
,"jobSearchProgressAB":"jobSearchProgress_AB_a"
,"jobScheduleCount":0,"evaluationCount":0,"salaryCount":0,"forbidLoginPopup":true,"currentUrl":""},"exam":{"job":{},"topicId":-1,
"gioBaseData":{},"miniAdPicUrl":""
,"isComplete":true,"isCompleteFinish":false,"isCompletePromise":{}},"interviewQuestion":{"interviewFilterObj":{}},"live":{"barrages":[],"needUpdateQaList":false,"hotValue":0,"liveStatus":-1,"drawStatus":false,"officialStatus":false,"companyStatus":false,"liveProcess":{"barrageVos":[],"process":[],"isDraw":false},
"currentNodeTiming":{"timing":0},"currentPlaybackTiming":0,"barrageKeywords":{"keywords":""},"drawDialogInfo":{"currentPrizeInfo":{},"dialogVisible":false},"currentUnderTab":{"tabName":""},"isfullScreen":false,"accountId":0,"accountType":2},"profile":{"isSelf":false,"profile":{},"followData":{"fansCount":0,"followCount":0,"likeCount":0,"visitorCount":0,
"blackCount":0,"shieldCount":0,"longContentCount":0,"momentCount":0},
"isNewExamTest":true,"isEnterprise":false,"isHaveLive":false,"enterpriseLivingId":0},"QuestionDetail":{"gioBaseData":{},"isFlod":false},"search":{"searchParams":{"query":"EnofTaiPeople","type":"user","searchType":"顶部导航栏","subType":0},"routerLoading":false,"logId":"14d6d830-fad0-11ee-88f0-8b1b27dd69b7","sessionId":"6699_1713148210483_7234","standardIds":[],"forbidLoginPopup":false,
"relationSearchData":[{"name":"华为","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},
"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"新凯来","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"美团","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},
"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"招银网络","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},
"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"腾讯","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"拼多多","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"阿里控股笔试","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},
"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"小米"
,"extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null},{"name":"vivo","extraInfo":{"trackID_var":"2d3v582d6g4ok2fgxhi02","dolphin_var":"1"},"typeName":null,"colorPc":null,"background":null,"router":null}]},"terminal":{"logId":""},"testPaper":{"fullscreen":false,"darkMode":false,"questionsMap":{},"questions":[],"paperSummary":{},"currentQuestionIndex":0,"currentExamStatus":"Ready","timeCount":0,"pageSource":"公司真题","testReportData":{},"newEditorAbFlag":false}},"pinia":{"app":{"userInfo":{},"isLogin":false,"isBusinessModel":false,"refreshed":false,"scrolled":false},"abTest":{"enterpriseForbidLoginPopupAB":"getRegisterTest_ab_B","newInterviewQuestionAB":true,"newPublishFromSearchAB":false,"companyQuestionAB":false,"zhuanxianglianxiAB":false}},"app":{"240":{"pageList":[{},{"showLoading":false,"list":[{"rc_type":0,"entityDataId":0,"trackId":"2d3v582c3ims9rtrlku3l",
"title":null,"expandType":0,

"extraInfo":null,"userId":497256985,"nickname":"EnofTaiPeople",
"admin":false,"headImgUrl":"https:\u002F\u002Fstatic.nowcoder.com\u002Fhead\u002F1photo.jpg","gender":null,"headDecorateUrl":"","jobId":11003,"pcHeadDecorateUrl":null,
"honorLevel":2,"honorLevelName":"小白牛 Lv.1","honorLevelColor":"c278e7",
"workTime":"2026",


"educationInfo":"湖南省长沙市长郡中学","identityList":null,"jobName":"C++","followed":false,"enterpriseInfo":null}]}],
"total":1,"totalPage":1,"noData":false,
"hasInited":true,"page":1},"242":{"loginRegisterTestAb":"getRegisterTest_ab_A"}},
"path":"\u002Fsearch\u002Fuser",
"fullPath":"\u002Fsearch\u002Fuser?query=EnofTaiPeople&type=user&searchType=%E9%A1%B6%E9%83%A8%E5%AF%BC%E8%88%AA%E6%A0%8F&subType=0"};(function(){var s;(s=document.currentScript||document.scripts[document.scripts.length-1]).parentNode.removeChild(s);}());</script>

求大佬指点 :yum:

仅作为学习交流Java,无其他用途
说明:已咨询copilot无果 :sweat_smile:

1 个赞

如果是Python,那么BeautifulSoap4

1 个赞

用字典读出来?

Java也有类似Python BeautifulSoap4的文档解析库的

1 个赞

暴力一点,直接正则提取

2 个赞

翻了下之前写过的爬虫,引入jSoup依赖,看一些jSoup的教程,解析出来就好了

1 个赞

来学习了

学一下 xpath 啊

能不能手把手教一下 :smiling_face_with_three_hearts:,我之前也尝试过

py的漂亮勺子一般就可以解析

1 个赞

而且你这个是js,给的json,不考虑层级变化的话,可以直接解析出来呀。
用正则更简单点,怎么变化层级都没事

1 个赞

用xpath解析啊

1 个赞

好,这个是新名词,待会去学习一下

看了下wrapper.txt,发现userId":497256985只有一处,可以用正则直接提取了

1 个赞

简单的用jsonpath语法就能读取出来,你里面有数组类型的数据结构就需要写一下循环遍历一下
$.app.240.pageList[1].list[0].userId

1 个赞

要我就 正则 …

1 个赞

感谢佬友们,问题已解决(之前copilot咋没这么快的思路呢 :clown_face:)
用直接用正则

        // 获取HTML内容
        String html = response.body();

        // 定义正则表达式
        Pattern pattern = Pattern.compile("\"userId\":(\\d+),");


        // 创建Matcher对象
        Matcher matcher = pattern.matcher(html);

        // 查找匹配的字符串
        if (matcher.find()) {
            System.out.println("Matched text: " + matcher.group(1));
        } else {
            System.out.println("No match found.");
        }
Matched text: 497256985

学习了

又是学习的一天 :grin:

这是script标签,后面跟的json格式数据,按照我以往的经验我会用正则或者切片,然后格式化json后提取。(也可暴力点,直接正则提)

1 个赞