I am on Windows and I am using the "Git for windows" tools in batch files. My etracted code from html site looks like this:
<a xmlns="http://www.w3.org/2000/svg" class="ZLl54 Dysyo" href="./g/git-for-windows/c/jgZ6P7bo7Fo"><div class="t17a0d"><span class="o1DPKc">[ANNOUNCE] Git for Windows 2.41.0</span></div><div class="WzoK">Dear Git users, I hereby announce that Git for Windows 2.41.0 is available from: https://</div></a>
and I want to extract /g/git-for-windows/c/jgZ6P7bo7Fo with sed or awk. The first part is always the same /g/git-for-windows/c/ but the ending of the url part differs.
What I did:
sed 's/^.*\("./g/".*"><div\").*$/\1/' text.txt | tee text2.txt but it doesn't work.
What I want: I want to extract the upper most (always latest) link to a new release of "Git for Windows" from website https://groups.google.com/g/git-for-windows. The decription shows Announce. Here are my steps:
xidel https://groups.google.com/g/git-for-windows --printed-node-format html -e "//'Links:',//a" | tee text.txt
to get the website as text.
Then I used cat text.txt | grep -F "announce" | head -1 | tee text1.txt.
The result is the exctracted code I posted above.
My questions: How to use sed or awk correctly to extract the link /g/git-for-windows/c/jgZ6P7bo7Fo from the code? Or how to use xidel in a better way to get better extractable results in text file.
Thank you for your help.
Since you tagged the post "batch"
Read the data from a file to
%%e. Use standard list-processing of%%eto set%%oto each space-separated token in turn. When thehreftoken is found, setextractedfor use as a flag. When the next token arrives, use tokenising on the redirectors to grab the quoted string, and assign that, minus the quotes toextractedand done.Well, almost. Need to remove the first character as you want the string minus the
.