Java Jsoup extracting <td> tags

64 Views Asked by At

I need to collect all languages into my collection from this site. Maybe someone knows how can I extract all of them without the column "ISO-639 code"?

enter image description here

I am using Jsoup for extracting HTML Tags. As far as I understand, in each tag, I need to receive the first tag. But I don't know how to get it. enter image description here

Such code will return all elements (including ISO-639 code):

Elements elementObj = doc.select( "table" )
    .select( "tbody" )
    .select( "tr" )
    .select( "td" );
2

There are 2 best solutions below

0
Krystian G On

First, get all tr's. Then iterate and get first td for each row:

Elements rows = doc.select("table").select("tbody").select("tr");
List<String> languageList = new ArrayList<>();
for (Element row : rows) {
    Element firstTd = row.selectFirst("td");
    languageList.add(firstTd.text());
}
System.out.println(languageList);
0
Jonathan Hedley On

Building off of Krystian's answer, here's a one-liner version using the nth-child selector to retrieve the first column, and the eachText() method that compiles the textnode value of each selected element into a list. Just some syntactic sugar to make your code a bit simpler.

String html = """
    <table>
    <tr><th>Language</th><th>ISO Code</th>
    <tr><td>Afrikaans</td><td>af</td></tr>
    <tr><td>Albanian</td><td>sq</td></tr>
    </table>
    """;

Document doc = Jsoup.parse(html);
List<String> languages = doc.select("tr td:nth-child(1)").eachText();

print(languages.toString());

Gives:

[Afrikaans, Albanian]

Try jsoup is a good way to play around with different selectors.