Ideas for:HTML Tabl...
 
Notifications
Clear all

Ideas for:HTML Tables to spreadsheet. (Fun Side Project)

4 Posts
2 Users
0 Likes
533 Views
(@dpathan)
Posts: 28
Eminent Member
Topic starter
 

I had a data that came across during my work and analysis has been completed.

For a side project, I am trying to combine a html export script and excel to put data in time line. The data is messages for all conversation between date range.

Link to sample html file https://drive.google.com/file/d/1HDG9huedXaIhShqWMoKJGBCu3s3x-WEK/view?usp=sharing

As you can see that each message that was sent or received is formatted in HTML <table> tag. There are thousands of messages sent and received and each one of the message is in <table> tag. Further, the messages (see link example) between participants is grouped under <table> conversation</table>.

So far I am able to copy this data directly in excel and run a VBA macro (from stackexchange) to transpose the data in rows and columns.This method helped me accomplish the goal to finish the case.

However, I was thinking to automate this more and make it more dynamic. The VBA script is set to take the data in first three rows and place it in column(transpose) and continue until the last selected cell.

Link to VBA macro code https://drive.google.com/file/d/15Ad230todJILd7zFg1QQjM-LCtp0YsyN/view?usp=sharing.

If we can find a way to directly export the nested tables from html in to spreadsheet this would be more simpler. The current method only works if there are constant number of headings for each message such as Author, Time and Body. But it will fail if there are some messages in the conversation which has more than three headings for e.g. Author, Time, Body, Attachments.

Now we can set the macro to read either 3 headings or 4 or 5 and so on but it is not dynamic.

So far, I can see two ways to go forward with this
1) copy the data from html and improve the vba macro or
2) Write or Find or improvise a javascript or python to parse the html into excel.

I have also played around with TableExport javascript Link) but it is probably built to export simple tables and not nested such as the html example.

Having the conversation in spreadsheet would help creating timeline and analysis of multiple conversations. Once I have this spreadsheet I can experiment with gephi to create social graph based on persons unique id which is in the message.

Let me know if there are any other ideas or theories.

 
Posted : 02/10/2018 11:05 pm
jaclaz
(@jaclaz)
Posts: 5133
Illustrious Member
 

From the example you posted, you have a set of tables, that can be summed up as a .ini file 😯

<html>
<head>
</head><body>

<table><tr><th>Author</th><td>Person A (10001)<br /></td></tr></table>
<table><tr><th>Sent</th><td>2018-04-07 214239 UTC<br /></td></tr></table>
<table><tr><th>Body</th><td>hello<br /></td></tr></table><br />
<table><tr><th>Author</th><td>Person A (10001)<br /></td></tr></table>
<table><tr><th>Sent</th><td>2018-04-07 180812 UTC<br /></td></tr></table>
<table><tr><th>Body</th><td>hi there<br /></td></tr></table><br />
<table><tr><th>Author</th><td>Person B (20002)<br /></td></tr></table>
<table><tr><th>Sent</th><td>2018-04-07 180533 UTC<br /></td></tr></table>
<table><tr><th>Body</th><td> great work <br /></td></tr></table>
</body>
</html>

i.e.
Author=Person A (10001)
Sent=2018-04-07 214239 UTC
Body=hello

etc.

So, you could use Nirsoft HTMLAsText to obtain the second format (or use a TAB or comma as a separator).

Then, you could use a plain set of formulas (without any VBA) to index the fields and perform a VLookUp on the contents, given that the first field of each set is always the same (i.e. "Author").

I'll see if I can put together such a spreadsheet.

jaclaz

 
Posted : 03/10/2018 9:03 pm
jaclaz
(@jaclaz)
Posts: 5133
Illustrious Member
 

Try this
http//s000.tinyupload.com/index.php?file_id=03873845793535332111

jaclaz

 
Posted : 04/10/2018 8:50 am
(@dpathan)
Posts: 28
Eminent Member
Topic starter
 

Thanks. That works perfectly. This will be a great help to analyze gmail and FB data as they are formatted this way in html.

Now, I have to update it according to the data that comes in from html. Also in future, as a procedure, I have to verify the format of extracted data before I can start analysis based on this.

 
Posted : 04/10/2018 3:41 pm
Share: