How to extract an embedded link from an as text saved html document OR how to use xidel to extract the correct link?

272 Views Asked by At

I am on Windows and I am using the "Git for windows" tools in batch files. My etracted code from html site looks like this:

<a xmlns="http://www.w3.org/2000/svg" class="ZLl54 Dysyo" href="./g/git-for-windows/c/jgZ6P7bo7Fo"><div class="t17a0d"><span class="o1DPKc">[ANNOUNCE] Git for Windows 2.41.0</span></div><div class="WzoK">Dear Git users, I hereby announce that Git for Windows 2.41.0 is available from: https://</div></a>

and I want to extract /g/git-for-windows/c/jgZ6P7bo7Fo with sed or awk. The first part is always the same /g/git-for-windows/c/ but the ending of the url part differs.

What I did: sed 's/^.*\("./g/".*"><div\").*$/\1/' text.txt | tee text2.txt but it doesn't work.

What I want: I want to extract the upper most (always latest) link to a new release of "Git for Windows" from website https://groups.google.com/g/git-for-windows. The decription shows Announce. Here are my steps:

xidel https://groups.google.com/g/git-for-windows --printed-node-format html -e "//'Links:',//a" | tee text.txt

to get the website as text. Then I used cat text.txt | grep -F "announce" | head -1 | tee text1.txt. The result is the exctracted code I posted above.

My questions: How to use sed or awk correctly to extract the link /g/git-for-windows/c/jgZ6P7bo7Fo from the code? Or how to use xidel in a better way to get better extractable results in text file.

Thank you for your help.

4

There are 4 best solutions below

1
Magoo On BEST ANSWER
@ECHO OFF
SETLOCAL
rem The following setting for the file is a name
rem that I use for testing and deliberately includes spaces to make sure
rem that the process works using such names. These will need to be changed to suit your situation.

SET "sourcedir=u:\your files"
SET "filename1=%sourcedir%\q76495893.txt"

SET "extracted="
FOR /f "usebackqdelims=" %%e IN ("%filename1%") DO (
 FOR %%o IN (%%e) DO (
  IF DEFINED extracted FOR /f "delims=<>" %%y IN ("%%o") DO SET "extracted=%%~y"&GOTO gotit
  IF "%%~o"=="href" SET "extracted=x"
 )
)
ECHO NOT found
GOTO :eof

:gotit
SET "extracted=%extracted:~1%"
ECHO extracted=%extracted%

GOTO :EOF

Since you tagged the post "batch"

Read the data from a file to %%e. Use standard list-processing of %%e to set %%o to each space-separated token in turn. When the href token is found, set extracted for use as a flag. When the next token arrives, use tokenising on the redirectors to grab the quoted string, and assign that, minus the quotes to extracted and done.

Well, almost. Need to remove the first character as you want the string minus the .

3
Renat On

This would work:

curl https://...  | grep -E -o ">\[ANNOUNCE.{0,800}" | grep ">\[ANNOUNCE.*href" | sed 's/<\/span.*href="\.\([^"]*\).*/ \1/'
1
Compo On

Based upon you already having the shown string as the content of a file named text1.txt, then a batch file could retrieve the required substring like this:

@Set /P "URL=" 0<"text1.txt"
@For /F Tokens^=2^ Delims^=^" %%G In ("%URL:*href=%") Do @Set "URL=%%~G"
@Echo %URL:~1%

How it works:

  1. Save the first line of text1.txt as the content of a variable named URL.
  2. Expand that variable, replacing everything up to and including the first instance of the string href with nothing, (="./g/git-for-windows/c/jgZ6P7bo7Fo"><div class="t17a0d"><span class="o1DPKc">[ANNOUNCE] Git for Windows 2.41.0</span></div><div class="WzoK">Dear Git users, I hereby announce that Git for Windows 2.41.0 is available from: https://</div></a> ). Then delimit it by doublequotes, asking for the second token, (= being the first). This results in the full URL only, (./g/git-for-windows/c/jgZ6P7bo7Fo), overwriting the initial variable value.
  3. Expand the resulting variable skipping the first character, %URL:~1%.

If you just wanted the ending part, then:

@Set /P "URL=" 0<"text1.txt"
@For /F Tokens^=2^ Delims^=^" %%G In ("%URL:*href=%") Do @Echo %%~nxG
11
BeniBela On

You do not need to call so many tools

Everything can be selected with XPath alone:

  xidel https://groups.google.com/g/git-for-windows -e "//a[contains(., 'ANNOUNCE')]/@href"