0%

用 C# .NET Core 爬取每季財報 | 用程式打造選股策略(3)

前言

上次我們已經用 .NET Core 成功的抓取了股價資訊 => 用 C# .NET Core 自動爬取台股每日股價
有了股價資訊,就可以計算技術面的數據,但如果根據公司的體質來選股,還需要一些基本面的數據,也就是財報!
那這次一樣使用 .NET Core 就來寫個季報爬蟲吧!也可以繼續比較看看和 Python 之間的差異!

觀察網站

這次要從 公開資訊觀測站 找尋我們要的資料

從 首頁 > 彙總報表 > 財務報表 > 採用IFRSs後 > 綜合損益表/資產負債表

這次目標就是將這 綜合損益表資產負債表 抓下來!

接著我們嘗試搜尋資料,觀察看看網站的Request

上圖以「綜合損益表」為例,
觀察兩張表後可以發現 綜合損益表資產負債表 中所傳送 Form-Data 都是一樣的資料,唯一的不同就只有URL
並且回傳的資料則是HTML,並且由多個Table所組成,如下圖:

並且根據不同的產業,Column可能不同,
因此,經過一陣思考後,決定將Column轉成Row來儲存,如下:

綜合損益表: type = 1
資產負債表: type = 2

爬取財報資訊

首先,定義一個列舉,分別為「綜合損益表」和「資產負債表」

1
2
3
4
5
6
7
8
public class EnumModels
{
public enum SeasonReportType
{
綜合損益表 = 1,
資產負債表 = 2
}
}

然後利用列舉取得它的URL

1
2
3
4
5
6
7
8
9
10
11
12
public string GetSeasonReportUrl(EnumModels.SeasonReportType seasonReportType)
{
switch (seasonReportType)
{
case EnumModels.SeasonReportType.綜合損益表:
return "https://mops.twse.com.tw/mops/web/ajax_t163sb04";
case EnumModels.SeasonReportType.資產負債表:
return "https://mops.twse.com.tw/mops/web/ajax_t163sb05";
default:
throw new ArgumentException($"{seasonReportType.ToString()} Url 不存在");
}
}

爬蟲部分

希望透過傳入參數的方式,就可以取得某一季的財報,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public async Task<string> ReadSeasonReportHtmlByTWSEAsync(EnumModels.SeasonReportType seasonReportType, int year, int season)
{
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/x-www-form-urlencoded"));
var response = await client.PostAsync(
GetSeasonReportUrl(seasonReportType),
new StringContent($"encodeURIComponent=1&step=1&firstin=1&off=1&isQuery=Y&TYPEK=sii&year={year}&season={season}")
);
var result = await response.Content.ReadAsStringAsync();
if (response.StatusCode != System.Net.HttpStatusCode.OK)
throw new PlatformNotSupportedException($"目前無法爬取財報資料...,{response.StatusCode}{result}");
return result;
}
}

到目前為止 result 就是財報的整個的HTML了。

解析HTML資料

接下來,還須需要解析這份HTML,這邊透過「HtmlAgilityPack」這個套件來解析

安裝套件:

1
dotnet add package HtmlAgilityPack

定義 Dapper 使用的 SeasonReport Model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[Table("SeasonReport")]
public class SeasonReport
{
[ExplicitKey]
public string stock_id { get; set; }
[ExplicitKey]
public int year { get; set; }
[ExplicitKey]
public int season { get; set; }
[ExplicitKey]
public string item { get; set; }
public string value { get; set; }
public int seq { get; set; }
[ExplicitKey]
public int type { get; set; }
}

目標是將資料轉為IEnumerable<SeasonReport>
首先,觀察到所有的我們要爬取的資料:

可以發現所有的我們要的table資料都有class=hasBorder
因此可以透過這個線索來抓取我們要的資料,
程式如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
public IEnumerable<SeasonReport> ParseSeasonReport(string html, int year, int season, EnumModels.SeasonReportType seasonReportType)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var tableNodes = doc.DocumentNode.SelectNodes("//table[@class=\"hasBorder\"]");
foreach (var tableNode in tableNodes)
{
var trNodes = tableNode.SelectNodes("./tr");
Dictionary<string, int> headers = new Dictionary<string, int>();
foreach (var trNode in trNodes)
{
//header
var thNodes = trNode.SelectNodes("./th");
if (thNodes != null)
{
int thIndex = 0;
foreach (var thNode in thNodes)
{
headers.Add(thNode.InnerText, thIndex++);
}
}
//data
var tdNodes = trNode.SelectNodes("./td");
if (tdNodes != null)
{
int tdIndex = 0;
foreach (var tdNode in tdNodes)
{
var report = new SeasonReport()
{
stock_id = tdNodes[headers["公司代號"]].InnerText,
year = year,
season = season,
item = headers.Keys.ElementAt(tdIndex),
value = tdNodes[tdIndex].InnerText,
seq = tdIndex + 1,
type = (int)seasonReportType
};
yield return report;
tdIndex++;
}
}
}
}
}

到此為止,就算是處理完資料囉!

排程 & 存入資料庫

接著存入資料庫的部分!並且希望做成一個排程。
排程的部分在 用 C# .NET Core 自動爬取台股每日股價 這篇已經說明過,
就是使用 Coravel 這個套件,這邊就不多說明了!

我們先整理一個讓排程呼叫的方法,做完上面一整串的動作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public async Task ExecuteAsync(EnumModels.SeasonReportType seasonReportType, int year, int season)
{
try
{
if(_seasonReportRepository.IsExist(seasonReportType, year, season)) {
_logger.LogDebug($"{seasonReportType.ToString()}, {year}, {season} data is exist");
return;
}
_logger.LogInformation($"SeasonReportClawer Running, {seasonReportType.ToString()}, year = {year}, season = {season}");
var html = await ReadSeasonReportHtmlByTWSEAsync(seasonReportType, year, season);
var reports = ParseSeasonReport(html, year, season, seasonReportType);
_seasonReportRepository.Insert(reports);
}
catch (Exception ex)
{
_logger.LogWarning($"SeasonReportClawer error\n{ex.Message}");
}
}

SeasonReportRepository.cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
public class SeasonReportRepository
{
private readonly SqlConnection _conn;
private readonly ILogger<SeasonReportRepository> _logger;
public SeasonReportRepository(ILogger<SeasonReportRepository> logger, SqlConnection conn)
{
_logger = logger;
_conn = conn;
}

public void Insert(IEnumerable<SeasonReport> seasionReportList)
{
try
{
using (var scope = new TransactionScope(TransactionScopeOption.Required, new TimeSpan(0, 5, 0)))
{
foreach (var seasionReport in seasionReportList)
{
_conn.Insert(seasionReport);
}
scope.Complete();
}
}
catch (Exception ex)
{
_logger.LogError(ex.Message);
}
}

public int GetMaxYear(EnumModels.SeasonReportType seasonReportType)
{
return _conn.ExecuteScalar<int>(
"select isnull(max(year), 106) from SeasonReport where type=@type",
new { type = ((int)seasonReportType) }
);
}

public int GetMaxSeason(EnumModels.SeasonReportType seasonReportType, int year)
{
return _conn.ExecuteScalar<int>(
"select isnull(max(season), 1) from SeasonReport where year=@year and type=@type",
new { year, type = ((int)seasonReportType) }
);
}

public bool IsExist(EnumModels.SeasonReportType seasonReportType, int year, int season)
{
return _conn.ExecuteScalar<bool>(
"select count(1) from SeasonReport where year=@year and season=@season and type=@type",
new {
year,
season,
type = ((int)seasonReportType)
}
);
}
}

這裡我希望從民國106年第一季的財報開始抓取,
從上面可以看出,透過SQL,已經技巧性的處理掉了MaxSeason、MaxYear
接著把就可以很輕鬆的把排程部分搞定,程式如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public class SeasonReportClawerSchedule : IInvocable
{
private SeasonReportClawer _seasonReportClawer;
private SeasonReportRepository _seasonReportRepository;
public SeasonReportClawerSchedule (SeasonReportClawer seasonReportClawer, SeasonReportRepository seasonReportRepository)
{
_seasonReportClawer = seasonReportClawer;
_seasonReportRepository = seasonReportRepository;
}

public async Task Invoke()
{
foreach(EnumModels.SeasonReportType seasonReportType in Enum.GetValues(typeof(EnumModels.SeasonReportType)))
{
int year = _seasonReportRepository.GetMaxYear(seasonReportType);
int season = _seasonReportRepository.GetMaxSeason(seasonReportType, year);
while(year <= DateTime.Now.Year-1911) {
while(season <= 4)
{
await _seasonReportClawer.ExecuteAsync(seasonReportType, year, season);
Thread.Sleep(7000);
season++;
}
year++;
season = 1;
}
}
}
}

最後在註冊我們的Schedule並指定爬取時間就可以囉!

1
2
3
scheduler
.Schedule<SeasonReportClawerSchedule>()
.Cron("0 1 15 * *");

心得

最後還是不免俗要跟 Python 做一些比較,
對剛學習程式語言的新手來說,Python 應該略勝一籌,且相關資源較豐富,
對熟手而言,我認為差異並不算是太大,主要還是得考驗本身對於程式語言的熟悉度吧!

↓↓↓ 如果喜歡我的文章,可以幫我按個Like! ↓↓↓
>> 或者,請我喝杯咖啡,這樣我會更有動力唷! <<<
街口支付

街口支付

街口帳號: 901061546

歡迎關注我的其它發布渠道